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0 A parallel array processor tor massively paranel applications is formed wrih tow power CMOS with DRAM 
processing white incorporating processing olamente on a single chip. Eight processors on a single chip have 

a Choir own associated processing element, significant memory, and I/O and are interconnected with a hypeicube 
based, but modified, topology. These nodea are then interconnected, either by a hypercube, modified hyper- 
01 cube, or ring, or ring wHh'm ring network topology. Conventional microprocessor MMPs consume pins and time 
C«4 going to memory. The new architecture merges processor and memory with multiple PMEs (eight 16 bit 
Ps processors with 32K and i/O) in DRAM and has no memory access delays and uses all the pins for networking, 
q The chip can be a single node of a fine-grained parallel processor Each chip will have eight 16 bit processors, 
fN each piQoessor providing 5 mips performance* I/O has three Internal porta and one external port shared by the 
W piural processors on the chip. Significant software ffexibiltty is provided to enatte quick implementation of 
O existing programs written In common languages. II la a developable and expandable technology without need to 
ft develop new pinoufa. new software, or new utilities as chip density increases and new hardware is provided for a 
gj chip function. The scalable chip PME hea internal end external connections for broadcast and asynchronous 
SIMD, M1MD and SlMIMD (SlMD/MIMD) wim dynamic switching erf modes. The chip can be used in systems 
which employ 32, 84 or 128*000 processor* and can be used for lower, intermediate and higher ranges- Local 
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and global memory functions can all do provided by We chips themsebes. and the system can connect to and 
support other global memories and DASO. The chip can be used as a microprocessor accelerator, in personal 
computer applications, as a vision or avionics computer system, or as workstation or supercomputer. There is 
proeram compatibiaty for (he «ully scalable 8-ystem. 
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FIELD OF THE INVENTIONS 

The invention 'Slates to computet and compute* systems and particularly to parallel array processor. 
In accordance wHh the invention, a parallel array processor <APAP) may be incorporated on a single 
a semiconductor silicon chip. This chip forms a basis for the systems described which are capable ol 
mase'rvety parallel processing or complex scientific and business applications. 

REFERENCES USED IN THE DISCUSSION OF THE INVENTIONS 

10 In the detailed discussion of Ihe invention, other works will be referenced, inducting references to our 
own unpublished works which are not Prior Ar?» which will akj the reader in following the discussion. 

GLOSSARY OF TERMS 

)s a ALU 

ALU is the arithmetic logic unit portion of a processor. 
0 Array 

Array refers to an arrangement m elements in one or more dimensions. An array can include an 
ordered set of date lisrns (array element) which in languages Jfea Fortran are Identified by a single 

so name. In other languages such a name of an ordered set of data items refers to an ordered collection 
or set of data elements, ail of which have identical attributes. A program array has dimensions 
specified, generally by a number or dimension aitribute. The declarator of Die array may also specify 
Ihe size of each dimension of the array in some languages, in some languages, an array is en 
arrangement of elements in a table. In a hardware sense, an array is a collection of structures 

2$ (functional elements] which are generally identical in a massively parallel architecture. Array elements 

in date parallel computing are stemente which can be assign&d operations and when parallel can each 
independently and In parallel execute the operations required. Generally, arrays may be though! of as 
grids of processing elements. Sections of the array may be assigned sectional data, so that sectional 
data can be moved around in a regular grid pettem However, dam can be indexed or assigned to an 

so arbitrary location in an array, 

o Array Director 

An Array Director is s unit programmed as a control lor far an array. It performs the function of a 
master controller for a grouping of Functional elements arranged in an array, 
o Array Processor 

as There two principal types of array processors • multiple instruction multiple data (MIMD) and single 

instruction multiple dale (SiMD). in a mimd array processor, each processing element in the array 
executes its own unique instruction stream witfi its own data. In a SIMD array processor, each 
processing element in the array is restricted to the same instruction via a common instruction stream; 
however the data associated with each processing element Is unique. Our pief erred array processor 

*o has other characterisScs. We call r Advanced Parallel Array Processor, and use the acronym APAP, 
o Asynchronous 

Asynchronous is without a regular time relationship; the execution of a function is unpredictable with 
respect to the execution of other Junctions which occur without a regular or predictable flrna 
relationship to other function executions. In control situations, a controller will address a location to 
4$ which control is passed when data is waiting for an idle element being addressed- This permits 
operations to remain in a sequence while they are out of time coincidence wrth any event 
o BOPS>GOP8 

BOPS or GOPS ere acronyms having the same meaning - billions of operations per second. See 
dOPS, 

so o Circuit Switched/Store Forward 

These terms refer to two mechanisms tor moving data packets through a twtwork of nodes. Store 
Forward is a mechanism whereby a data packet is received by each intermediate node, stored into Its 
memory, and (hen forwarded on towards its destination. Circuit Switch Is a mechanism whereby an 
intermediate node o commanded to logically connect «« input port to an output port such that data 

so packets can pass directly through the node towards their destination, without entering the intermediate 

node's memory, 
o Cluster 

A cluster is a station (or functional unit) which consists of a control unit (cluster controller) and the 
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hardware (which may be terminals, functional units, or virtual components) attached to it. Our Cluster 
includes an array of PMEs sometimes called a Node array. Usually a duster has 512 PMEa. 

Our Entire PME node array consists of a set of dusters, each cluster supported by a cluster 
controller (0C), 
Cluster controller 

A duster comroto is a device thai controls input'output {t-O) operations for mora man one devioe or 
functional unit connected to it A cluster corrarotfer is usually comroJIed by a program stored and 
executed In the unit a$ « was In the IBM 3301 Finance Communication Controller but It can be 
entirely controlled by hardware as it was in the IBM 3272 Control Unit 
Cluster synchronic 

A duster synchronizer is a functional unit which manages me operations of ail or part of a duster to 
maintain synchronous operation ot the elements so that the tuncHonal units maintain a particular time 
relationship with the execution of a program. 
Controller 

A controller is a device that directs the transmission of data and instructions over the links of an 
interconnection network; Its operation Is controlled by a program executed by a processor to which 
the controller is connected or by a prOQram executed within the device. 
CMOS 

CMOS Is an acronym tor Complementary Metal-Oxide Semiconductor technology. It is commonly 
used to manufacture dynamic random access memories {DRAMs). nmos is another technology used 
to manufacture DRAMS, We prefer CMOS but the technology used to manufacture the APAP is noi 
intended to limit the scope of the semiconductor technology which Is employed. 
Dotting 

Dotting refers to the joining of three or mora leads by physically connecting them together. Most 
backpaneJ busses share this connection approach. The term relate? to OR DOTS of times past but Is 
used here to identity multiple data sources that can be combined onto a bus by a very simple 
protocol. 

Our I/O zlppet concept can bo used to Implement the concept that the port into a node could bo 
driven by the port out of a node or by data coming from the system bus. Conversely, data being put 
out of a node would be avaflable to both the input to another node and to the system bus. Note lhat 
outpuiting data to both the system bus and another node is not done simultaneously but in different 
cydes. 

Bolting Ib used In the H-DOT discussions where Two-ported PEs or PMUs or Pickets can be used 
in arrays of various organizations by tak«>g advantage of dotting. Several topologies are discussed 
including 20 and 3D Meshes. Base 2 M-cubo, Sparse Base 4 N-cube, and Sparse Base 8 N^cube. 
DRAM 

DRAM is an acronym for dynamic random access memory, tl» common storage used by computers 
for main memory. However, the term DRAM can be applied to use as a cache or as a memory which 
is not the main memory. 
FLOATING-POINT 

A floating-point number is expressed in two parts. There is a fixed point or fraction pari, and an 
exponent part to some assumed radix or Base. The exponent Indicates the actual placement of the 
decimal point. In the typical floating-point representation a real number 0,0001234 is represented as 
0.1234-3. where 0.1234 is the fixed-point part and O is the exponent. In this example, the floating- 
point radix or base is 10, where 10 represents the implicit fixed poerSve Integer base* greater than 
unity, that Is raised to the power explicitly denoted by the exponent m ths floating-point representation 
or represented by the characteristic in the floating- point representation and (hen multipBed by the 
fixed-point part to determine the real number represented. Numeric literals can be expressed in 
floating-point notation as well as real numbers, 
FLOPS 

This terms refers to noating-point instructions per second. Floating-point operations include add. 
SUB, MPY, DiY and often many others. Floating-point instructions per second paramerar is often 
calculated using the add or multiply instructions and. in general* may be consldened to have a 50/50 
mix. An ope/aeon includes the generation ot exponent, fraction and any required fraction normaliza- 
tion. We could address 32 or 4&-bit floating-point formats (or longer but we have not counted them in 
the mix.) A floating-point operation when implemented with fixed point Instructions {normal o* RISC) 
requires multiple instructions. Some use a 10 to 1 ratio in figuring performance while some specific 
studies have shown a ratio of 8.25 more appropriate to use. Various architectures will have different 
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ratios. 
o FuncflonaJ unit 

A Junctional unit is an entity of hardware, software, or both, capable of accomplishing a purpose, 
o Gbytes 

s Gbytes refers to a oft'ioo bytes. Gbytee/s would be a billion bytes per second, 

o GIGAFLOPS 

(lOrS floating-point Instructions per second, 
o GOPSand PETAOP9 

GOP6 or BOPS> have frie same meaning - billions of operations per second. PETAOPS means 
70 trillions of operations per second , a potential of the current machine. For our APAP machine fray are 
Just about the same as BIPsGrPs meaning billions of instructions per second. In some machines an 
Instruction can cause two or more operations (le. both an add and multiply) but we don't do that. 
Alternatively it ooutd take many instructions to do an op. For e*ampte we use multiple instructions to 
perform 64 bit arithmetic- In counting ops however, we did not etect to count log ops. GOPS may be 
*a the preferred use to describe performance, but there Is no consistency In usage that has been noted. 

One sees MFsMOFs then BIFs/BOPs end MegaFLOP&'GigaFLOPSrTeraFLOP&PetaFiope. 
o ISA 

ISA means the Instruction Set Architecture, 
o Link 

to A [ink is an element which may be physical or logical. A physical link Is the physical connection tor 

Joining elements or units, while in computer programming a link is an Instruction or address that 
passes control and parameters between separate portions of the program, m multisyetems a link is 
(he connection between two systems which may be* specified by program code identifying the link 
which may be identified by a reel or viriual address. Thus generally a link includes me physical 

n medium, any protocol, and associated devices and programming; it is both logical and physical, 

o MFLOPS 

MFLOPS means (10)^6 floating-point instructions per second. 

o mimd 

mimd is used to refer to a processor array architecture wherein each processor in the array has its 
90 own instruction stream, thus Multiple Instruction stream, to execute Multiple Data streams located one 
per processing element, 
o Module 

A module Is a program unit that is discrete and identifiable or a functional unit of hardware designed 
tor use with other components. Also, a collection of PCs contained in a single electronic chip is called 

» a module, 
o Node 

Generally, a node is the junction of links. In a generic array oi PEs, one PE can be a node. A node 
can also contain a collection of PEs called a module, m accordance wqh our invention a node is 
formed of an array of PMEs, and we refer to the set of PMEs as a node. Preferably a node is 8 PMEs, 
40 o Node array 

A collection of modules made up of PMEs i$ sometimes referred to as a node a/ray, is an away of 
nodes made up of modules. A node array is usually more than a few PMEs, but the term 
encompasses a plurality, 
o PDE 

<6 A PDE is a partial differential equation, 

o PDE relaxation solution process 

POE relaxation solution process cs a way to solve a POE (partial differential equation*. Solving PDEe 
uses most of the super computing compute power in the known universe and can therefore be a good 
example of me relaxation process. There are many ways to solve the pde equation and more than 

so one of the numerical methods includes the relaxation process. For example, if a PDE is solved by 
finite element metnods relaxation consumes the bulk ot the computing time. Consider an example 
from the world ot heat transfer. Given hot gas inside a chimney and a cold wind outside, how wiP the 
temperature gradient within the chimney bricks develop? By considering the bricks as tiny segments 
and writing an equation that says how heal flows between segments as a function of temperature 

ss differences then the heat transfer PDE has been converted into a finite element problem. H we then 
say all elements except those on the Inside and outside are at room temperature while the boundary 
segments are at (he hot gas and cold vrfnd temperature, we have set up the problem to begin 
relaxation. The computer program then models time by updating the temperature variable in each 
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segment based upon the amount of heat that flows into or out of the segment K takes many cycles of 
processing el the segments In the model before the eel of temperature variables acrose the chimney 
relaxes to represent actual temperature distribution thai would occur in the physical chimney. » the 
objecfive was to mods! gas cooling In me chimney thon me etemente would have to extend to gas 

s equations, and the boundary conditions on the Inside would be linked to another finite element model, 

and tns process conSnues. Note that ma heel flow is dependent upon the temperature difference 
between the segroorn and Its neighbors. It thus uses me inter-PE communk^on paths to distribute 
the temperature variables. It Is this near neighbor communication pattern or characteristic that makes 
PDE relation very applicable to parallel computing. 

70 o PICKET 

This is the element in an array of dements making up an array processor, it consists of: date flow 
<ALU REGS), memory, control, and ttt* portion ot tne communication matrix associated with the 
element. The unit refers to a 1/nih of an array processor made up of parallel processor and memory 
elements wtih their control and portion erf the array Intercommunication mechanism- A picket Is a form 

J3 of processor memory element or PME. Our PME chip design processor logic can implement the 

picket logic described in related applications or I rave the logic for the array of processors formed as a 
node. The term PICKET is similar to the commonly used array term PE for processing element and 
Is an element of the processing array preferably oomprleod of a combined processing element and 
local memory for processing bit parallel bytes of Information In a dock cycle. The preferred 

20 embodiment consisting of a byte wide data flow processor, 32k bytes or more of memory, primitive 
controls and ties to communications with other pickets. 

The term "picker comes from Tom Sawyer and his white fence, although it will also be 
understood functionally that a military picket line analogy fits quite well, 
o Picket Chip 

29 A picket chip contains a plurality of pickets on a single silicon chip, 
o Picket Processor system (or Subsystem) 

A picket processor Is a iota! system consisting of an array of pickets, a communication network, an 
VO system, and a SIMD controller consisting of a microprocessor, a canned routine processor, and a 
micro-controller that runs the array. 

30 o Picket Architecture 

The Picket Architecture i9 the preferred embocfiment tor the SIMD architecture wrih features mat 
accommodate several diverse kinds of problems including: 

- eel associative processing 

- parallel numerically intensive processing 

as * physical array processing similar to images 

o Picket Array 

A picket array *e a collection of pickets arranged in a geometric order, a regular array, 
o PME or processor memory element 

PME is used toe a processor memory element. We use the term PME to refer to a singfe processor. 

<« memory and W capable system element or unit thai forms one of our parallel array processors. A 
processor memory element is a term which encompasses a picket A processor memory element is 
1/ntb of a processor array whlcfi comprises a processor, rts associated memory, control Interface, and 
a poriion of an me^ communication network mechanism. This element can have a processor memory 
element with a connectivity of a regular array, as in a picket processor, or as part of a subarray, as in 

<s the multi-processor memory element node **e have described, 
o Routing 

Routing is the assignment of a physical path by which a message will reach its destination Routing 
assignments have a source or origin and a destination. These elements or addresses have a 
temporary relationship or efnnfty. Oiten, message routing is based upon a key which is obtained by 
so reference to a table of assignments. In a network, a destination is any station or network addressable 

unit addressed as the desrin&rion of information transmitted by a path control address mat identities 
the link. The destination field identities the destination with a message header destination code, 
o SIMD 

A processor array architecture wherein au processors in the array are commanded from a Single 
so Instruction stream to execute Multiple Data streams located one per processing element. 

0 $IMDMIMD or SIMD/MIMD 

SIMDMIMD or SIMD/MIMD is a ierm referring to a machine that has a dual function that can switch 
from MIMD to SiMO for a period of time to handle some complex Instruction, and ehus has two 
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modes. The Thinking Machines, Inc. Connection Machine model CM-2 when placed as a front end or 
bsch end of a MIMD machine permitted programmers to operate different modee for execution of 
different parte of a pmbfem. referred to sometimes a dual modes* These machines have existed since 
Uliac and have employed a bus that interconnects the master CPU with other processors. The master 
a control processor would have the capability of interrupting the processing of other CPUs. The other 

CPUs could run independent program code. During an interruption, some provision must be made for 
checkpointing (closing and saving current statue of the controlled processors), 
o SIMIMD 

SI Ml MO is a processor array architecture wherein ail processors In the array are commanded from a 
io Single Instruction stream, lo execute Multiple Data streams located one per processing element. 

WKnln this construct data dependent operations within each picket that mimic instruction execution 

are controlled by the SIMD instruction stream. 

This is a Single Inslsuction Stream machine with the ability to sequence Multiple Instruction 

streams (one per Picket) using the SlMD instruction stream and operate on Multiple Data Streams 
}5 (one per Picket). SIMIMD can be executed by a processor memory etemerri system. 

SISD 

SISD is an acronym for Single Instruction Single Data. 
20 o Swapping 

Swapping interchanges the data content of a storage area with that of another area of storage. 

o Synchronous Operation 

Synchronous operation in a MIMD machine is a mode of operation in which each action is related to 
an event {usually a clock): it can be a specked event that occurs regularly in a program sequence. An 
24 operation Is dispatched to a number of PEs who then go oft to independently perform the function. 
Control is not returned to the controller until the operation is completed. If the request is to an array of 
functional unite, the request is generated by a controller eo elements in the array which must complete 
their operation before control Is returned to the controller. 

0 TERAFLOPS 

so TERAFLOPS means (10rt2 floatlng~BotnL Instructions per second, 

o VLSI 

VLSI is an acronym for very large scale integration {as applied to integrated circuits), 
o Zipper 

A zipper is a new function provided. ft allows for links to be made from devices which are external to 
is the normal interconnection of an array configuration. 

BACKGROUND OF THE INVENTION 

In the never ending quest for faster computers, engineers are linking hundreds, and even thousands of 
40 low coat microprocessors together in parallel to create super supercomputers that divide in order to conquer 

complex problems that Btump today's machines. Such machines are called massivery parallel- We have 
created a new way to create massively parallel systems. The many improvements union we have made 
should be considered against the background of many worfca of others. 

Multiple computers operating in parallel have existed for decades. Early parallel machines included the 
<s illiac which was started In the 1980s. ILUAC IV was built in the 1970s. Other multiple processors Include 
(see a partial summary in U.3- Patent 4 ( £75,8$4 issued December 4, 1AQ0 to Xu et al) the Cedar, Srgma*i, 
the Butterfly and the Monarch, the Intel ipse. The Connection Machines, the CaJtech COSMIC, the N Cube, 
IBM's RP3v IBM's <3Fi1, the NYU Ultra Computer, the Intel Defta and Touchstone. 

Large multiple processors beginning with ILLIAC have been considered supercomputers. Supercom- 
so puters with greatest commercial success have been based upon multiple vector processors, represented by 
tha Cray Research Y-MP systems, me IBM 3080. and other manufacturers machines including those of 
Amdahl, Hitachi. Fujitsu, and NEC 

Massively Parallel Processors (MPPs) ate now thought of as capable of becoming supercomputer 
These computer systems eggregate a large number of microprocessors with an interconnection network 
38 and program them to operate in parallel. There have been two modes of operation of ihese computers. 
Some of these machines have been MIMD mode machines. 

Some of these machines have been SlMD mode machines. Perhaps the most commercially acclaimed 
of these machines has been the Connection Machines series 1 and 2 of Thinking Machines, Inc . These 
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have been essentially SIMD machines. Many of the* massively parallel machines havs used microprocessors 
Interconnected In parallel to obtain their concurrency or parallel opera! tons capability. Intel microprooesaora 
Rkd i860 iwve been used by Intel ami others. N Cube had made such machines with Intel '386 
microprocessors. Other machines have been bum with what is called the "transputer" chip. Inmos 

s Transputer IMS T800 Is an example. The Inmos Transputer T800 Is a 32 bit device wllh an Integral high 
speed floating point processor. 

As an example of the kind of systems that are bum, several Inmos Transputer T800 chips each would 
have 02 communication Br* Inputs and 32 link outputs. Each chip would have a single processor, a small 
amount of memory, and communication links to the local memory and to an external Interface- In addition, 

?t> In order to build up the system communication Jink adaptors fik© (MS C011 and O0J2 would be connected, 
in addition switches, like e IMS C004 would provide* say, a crossbar switch between the 32 link inputs and 
32 link outputs to provide point-to-point connection between additional transputer chips. In add-on, there 
will be special circuitry and interface chips for transputers adapting them to be used for a special purpose 
tailored to the requirements of a specific device, a graphics or disk controller. The Inmos IMS M212 Is a 19 

f 5 bit processor, with on chip memory and communication finks. It contains hardware and logic to control disk 
drives and can be used as a programmable disk controller or as a general purpose interface, in order lo use 
the concurrency {parallel operations} Inmos developed a special language, Occam, for the transputer. 
Programmers have to describe the network of transputers directly In en Occam program. 

Some of these massively parallel machines use parallel processor arrays of processor chips which are 

as> inrerconnected with different topologies. The transputer provides a crossbar" network with the addition of IMS 
C004 chips, dome other systems use a hypercube connection- Others use a bus or mesh to connect the 
microprocessors and there associated circuitry. Some have been interconnected by circuit switch proces- 
sors that use switches as processor addressable networks. Generally,, as with me 14 RtSC'tiOOCs whteh 
were interconnected last Tall at Lawrence Livermore by wiring the machines together, the processor 

23 addressable networks have been considered as coarse-grained multiprocessors* 

Some very large machines are being built by Intel and nCube and others to sitae* what are called 
"grand chsdenges N in data processing. However, these computers sre very expensive. Recent projected 
costs are In the order of $3O,Q0O»00O.OO to $75,000,000.00 (lera Computer) for computers whose 
development has been funded by me U.S. Government to attack the "grand challenges*. These "grand 

so challenges" would Include such problems as climate modeling, fluid turbulence, pollution dispersion, 
mapping of the human genome aiKl ocean circulation, quantum chrcmodynamics, semiconductor and 
supercomputer modeling, corobusson systems, vision and cognSion, 

As a footnote to our background, we should recognizee one of the early massively parallel machines 
developed by IBM. In our description we have chosen to use the term processor memory element rather 

33 than "transputer 11 to describe one of the eight or more memory units with processor and I/O capabilities 
which make up the array of PMEs in a chip, or node. The referenced prior art "transputer" has on a chip 
one processor, a Fortran coprocessor, and a small memory, wHh an I/O interface. Our processor memory 
olemenl could apply to a transputer and to the PME of the RP3 generally. However, as will bo recognized, 
our Bttle chip is significantly different In many respects. Our little chip has many features described later. 

do However, we do recognize that the term PME was first coined ior another, now more typical. PME which 
formed the basis for the massively parallel machine known as the RP3. The IBM Research Parallel 
Processing Prototype {RP3) was an experimental parallel processor based on a Multiple Instruction Multiple 
Oata (MIMD) architecture, RP3 was designed and built at I8M T.J. Watson Research Center in cooperation 
wilh the New York University Ultracompuier project. This work was sponsored in part by Defense Advanced 

<*s Research Project Agency. RP3 was comprised of 64 Processor-Memory Elements (PMEs) Interconnected 
by a high speed omega network. Each PME contained a 32-bit IBM *PC sdentific v microprocessor, 32-k8 
cache, a 4-MB segment of the system memory, and an I/O porl. The PME I/O port hardware and software 
supported initialization, status acquisition, as well as memory and processor communication through shared 
tO support Processors (l8Ps). Each ISP supports eight processor- memory elements through the Extended 

so I/O adapters (ETlOs), independent of the system network. Each ISP Interfaced to the IBM S/370 channel 
and the IBM Token-Ring network as wei) as providing operator monitor service. Each extended I/O adapter 
attached as a device to a PME ROMP Storage Channel <RSC) and provided programmable PME 
control/status signal I/O via the ETIO channel. The ETIQ channel Is the 32-b«t bus which interconnected the 
ISP to the eigm adapters. The ETIO channel relied on a custom interface protocol with was supported by 

53 hardware on the ETIO adapter and software on the ISP. 
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Problems addressed by our APAP machine* 

The machine which we have ceiled the Advanced Parallel Array Processor (APAP) is a fine-grained 
peraltel processor which we believe is needed to address issues of prior designs. As illustrated above, there 
have been many line-grained (and also coarse-grained) processors constructed From both point design and 
ofMhe-ahelf processors using dedicated and shared memory and any one of the many possible intercon- 
nection schemes. To daw these approaches have all encountered one or more design and performance 
limitations. Each "solution'' leads In a different direction. Each ha9 Hs problems. Existing parallel machines 
are difficult to program. Each is not generally adaptable to various sizes of machines compatible across a 
range of applications. Each has its design limitations caused by physical design, interconnection and 
architectural issues. 

Physical issues 

Some approaches utiUze a separate chip design for each or the various functions required in a 
horizontal structure. These approaches suffer performance limitations due to chip crossing delays, 

Oiher approaches Integrate various functions together vertically into a single chip- These approaches 
suffer performance limitations due to the physical Ilmft on the number of logic gates which can be 
integrated onto a producible chip. 

Interconnection Issues 

Networks which interconnect the various processing functions are important to fine-grained parallel 
processors. Processor designs with buses, meshes, and hypercubes have all been developed. Each of 
these networks has Inherent amrtettons as to processing capability. Buses limit both the number of 
processors which can be physically interconnected and the network performance. Meshss lead to large 
network diameters which limit network performance. Hypercubes require each node to have a large number 
of Interconnection ports; the mimbet of processors which can be interconnected is limited by the physical 
input'output pins at the node. Hypercubes are recognized as having some significant performance gains 
over the prior bus and mesh structures. 

Arch fteciu ret Issues: 

Processes which are suitable lor fine-grained parallel processors fall into two distinct types. Processes 
which are functionally partriionable tend to perform better on mutepl© instruction, mu&pte data <MIMO) 
architectures. Processes which are not functionaJly parUttonabie but have multiple data streams tend to 
perform baiter on single instruction, multiple data (SIMO) architectures. For any given appScation, there is 
flkery to be some number erf both types of processes. System trade-ofts an© required to pick the 
architecture which best suits a particular application but no single solution has been satisfactory. 

SUMMARY OF THE INVENTION 

We have created a new way to make massively parallel processors and other computer systems by 
creating a new "chip" and systems designed with our new concept This application is directed to such 
systems. Componenrs described in our applications can be combined in our systems to make new 
systems. They also can be combined wrth westing technology, 

Tlwtt, our IUtle CMOS DRAM chip of apprxaxtmately 14 x 14 mm can be put together much like bricks 
are walled in a building or paved to form a brick road. Our chip provides the structure necessary to build a 
"house - , a comptex computer system, by connected replication. 

Racing our development in perspective* four little chips, each one alike, each one with eight or more 
processors embedded in memory with an internal array capability and external I/O broadcast and control 
interface, would provide me memory and processing power of thircy-sfcx or more complex computers, and 
they could all be placed with compact hyMd packaging into something the size of a watch, and operated 
with very low power, as each chip only cfissipates about 2 wawa, with wis chip, we have created many new 
concepts, and Aoso ihat we consider our invention are described in detail in the description and claims. 
The systems tftat ceo be created with our computer system can range from small devices to massive 
machines with PETAOP potential. 



EP 0 570 729 A2 



Our little memory chip array processor we call our Advanced Parade* Array Processor. Though small, ri 
Is complex and powerful. A typical cluster will have many chips. 

Many aspect* and features of invention have been described in thiu and related applications. These 
concepts and features of invention improve and aro applicable to computer systems which may not employ 

s each invention. We believe our concepts end features will be adopted and used In the next century. 

This technical descripfion provides en overview of our Advanced Parallel Array Processor (APAP} 
representing our new memory concepts and our effort in developing a scalable massively parallel processor 
(MPP) that Is simple (very small number of unique part numbers) and has very high performance. Our 
processor utilizes in Its preferred embodiment a VLSI chip. The chip comprises 2n PME microcomputers, 

30 "n" represents the maximum number of array dimensionality. The chip further comprises a broadcast and 
control interlace (BCD and Internal and externa} oommunte^on paths between pmes on the chip among 
themselves and to the of-- chip system environment The preferred chip has 3 PMEs (but we also can 
provide more) and one BCL The 2n PMEs and DC I are considered a node. This node can function in either 
SIMD or MIMQ mode, in duel SIMQJMODE, wrfji asynchronous processing, and with SIMIMD functionality. 

>5 Since it is scalable, this approach provides a node which can be the man building block for scalable 
. parallel processors of varying size. The microcomputer architecture of the PME provides tally distributed 
message passing* interconnection and control features within each node, or chip. Each node provides 
multiple parallel microcomputer capability at the chip level, the microprocesor or personal computer level, at 
a workstation level, el special appllcaUon levels which may be represented by a vision and/or avionics lewel, 

20 end, when fully extended, to capability at greater levels with powerful Qigafiop performance into the 
supercom purer range. The simplicity is achieved by the use of a single highly extended DRAW Chip thei is 
replicated into parallel clusters. This keeps (he part number count down and allows scaling capability to (he 
cost or performance need, by varying me chip count, then «he number of modules, etc. 

Our eppcoech enables us lo provide e machine with etiributes meeting the requirements that drive lo a 

25 parallel solution In a sertes of applications. Our methods of parallellzatlon at the sub-chip level serve to keep 
weight, volume, and recurring and logistic costs down. 

Because our different size systems are all based upon a single chip, software tools are common lor ail 
sl2e systems. This offers the potential of development software (running on smaller (workstation machines) 
that is interchangeable among ell levels (workstation, aerospace, and supercomputer). That advantage 

so means programmers can develop programs on workstations while a production program runs on a much 
larger machine. 

As a result of our well balanced design implementation we meet today's requirements imposed by 
technology, performance, cost, end perception, and enable growth of the system Into the future. Since our 
MPP approach starts at the chip level, our discussion starts at the chip technology description and 

as concludes with the supercomputer application descriptions. 

Physical, interconnection, and architectural issues will all be addressed in the machine directly. 
Functions not only be integrated into a ssngia chip design, but tl>e chip design win provide functions 
sufficiently powerful and flexible dial (ho chip will be effective at processing, routing, storage and three 
classes of I/O. The interconnection network will be a new version of the hypercube which provides minimum 

40 network diameters without the inpuPOutput pin and wireebitfty limitations normally associated wrih hyper- 
cubes. The trade-off between SlMD and MIMD are eliminated because the design allows processors to 
dynamically switch between mimd and SlMD mode. This eliminates many problems which will be 
encountered by application programmers of "hybrid* machines. In addfflon, the design will snow a subset of 
(he proceGGore lo be in SlMD or MIMD mode. 

<ks The Advanced Parallel Array Processor (APAP) is a fine-grained parallel processor, it consists of comroi 
and processing sections which are partracnabte such that configuration? suitable for supercompuling 
through personal computing applications can be saltefied. In moat configurations It would attach to a heel 
processor and support the off loading of segments of the i>oe?"a workload- Because the apap array 
processing elements are general purpose computers, the particular typs of workload off-loaded will vary 

so depending upon the capabilities of the host For example, our APAP can be a module (or an IBM 3090 
vector processor manframe. When aftaclwd to a mainframe wrth high performance vector floating point 
capability the task off-loaded might be sparse to dense matrix transformations. Alternatively, when attached 
to a PC personal computer the off-loaded task might be numerically Intensive 3 dimensional graphics 
processing, 

53 The above referenced parent USSN 07/61 1,5&4, filed November 14, 1&0 of Dieffonderier et al„ tilled 
••parallel Associative Processor System'* describes the idea of integrating computer memory and control 
logic wKhin a smgie chip and replicating the combination within the chip and building a processor system 
out of replications ot the single chip. This approach which is continued and expanded here leads to a 
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system which provfchs massively parallel processing capability at the cost of developing and manufacturing 
only a single chip type while enhancing performance capability by reducing the chip boundary crossings 
and find length. 

The above referenced parent US8N 07*11,594, filed November 13, 1$90 HtuBfrated utilization of i- 

9 dimensional I/O structures(essentlally a linear I/O) wtlh multiple SlMD PMEs attached to that structure within 
a chip. This embodiment elaborates these concepts to dimensions greater than 1. The description which 
follows will be In ©rone of 4-dlrnensional I/O structures with S SIMD-'MIMD PMEs per chip. However, that 
can be extended to greater dimensionality or more PMEs per dimension as we will describe with respect to 
FIGURES 3, 9, 10. 15 and 16. 

10 Our proceBBing element includes a full I'O system including both data transfer and program Interrupts. 
Our description of our preferred embodiment will bo primarily described in terms of the preferred 4- 
dimensional i-D structures with 8 SIMDMIMD PMEs per chip, which has special advantages now *m our 
view. However. lhai can be extended to greater dfnvenswnalUy or more PMEs per dimension as described 
in our parent application. In addition, for most applications we prefer and have made inventions m areas of 

is greater dimensions with hypercube interconnections, preferably with the modified hypercube we describe. 
However, In some applications a 2-dimenstonaJ mesh Interconnection of chips will be applicable to a task at 
hand. For instance, in certain mdHary computers a 2 dimensional mesh will be suitable and cost effective. 

This disclosure extends the concepts from the Irrterprocessor communication to the external 
Input/Ouiput facilities and describes the Interfaces and modules required tor control of the processing array. 

so in summary three types of vo t inter-processor, processors to/from external, and broadoastr'oontroi are 
described. Massively poralM processing systems require all these types of K> bandwidth demands to be 
balanced with processor computing capability. Within the array these requirements wiD be satisfied by 
replicating a 16 bit (reduced) Instruction set processor, augmented with very fast Interrupt state swapping 
capability. Thai processor is referred to as the PME illustrating the preferred embodiment ol our APAP- The 

as c*>aracterlstiC9 of the PME are completely unique when oomparsd wtfih the processing elsments on other 
massively parallel machines. It permits me processing, routing, storage and I/O to be completely distributed. 
This is not characteristic of any other design. 

In a hypercube each PME can address as its neighbor* any PME whose address differs In any single bit 
position. In a ring, any PME can address as its neighbor the two PMEs whose addresses differ t 1. The 

30 modified hypercube of our preferred embodiment utilized for the APAP combines these approaches by 
building hypercubes out of rings. The intersection of rings is denned to be a node. Each ncde of our 
preferred system has its PME, memory and |.'O f and other raaturea of the node, formed m a semiconductor 
silicon low level CMOS DRAM chip. Nodes are constructed from multiple PMEs on each chip. Each PME 
exists In only one ring of nodes. PMEs within the node are connected by additional rings such tltat 

35 communications can be routed between rings within the node. This leads to the addressing structure where 
any PME can sisp messages toward the objective by addressing a PME In Us own ring or an adjacent pme 
wfthin the node. In essence a PME can address a PME whose address differs by 1 in one In the 1n?d bit 
field o$ its ring (where d is the number of PMEs in the ring) or the PME with (ho same address but existing 
in an adjacent dimension. The PME effectively appears to exist In n sets of rings, while in actuality It exists 

40 only in one reel ring and one hidden ring totally contained wahin the chip. The cftmensionality for the 
modified hypercube is defined to be the value n from the previous sentence. 

We prefer to use a modified hypercube. This Is elaborated In the part of this application describing the 
technology. Finally. PMEs wrthin a ring am paired such that one moves data externally clockwise along a 
ring of nodes and the olhsr moves data externally counterclockwise along tie ring of nodes, thus dedicating 

n a pme to an external port. 

in our massively parallel machine, in our preferred embodiment the interconnection and broadcast of 
data and Instructions from one PME to another PME In the node and externally of the node to oilier nodes 
of e duster or PMEs of a massively parallel processing environment ate performed by a programmable 
router, allowing reconfiguration and virtual flexibility to the network operations. This im portent feature is fully 

so distributed and embedded In the PME and allows lor processor communication and data transfers among 
PMEs during operations of the sytfem in SMD and MIMO modes, as well as in the StMF>MIMD and 
SIMIMO modes of operation. 

Within the rings each interconnection leg is a potnl-te-polnt connection. Each PME has a pomi-to-pomt 
connection with the two neighboring PMEs in its ring and with two neighboring PMEs in two adjacent rings, 

5s Three of these point-to-point connections are internal to the node, while the fourth point-to-point connection 
is to an adjacent node. 

The massively paranel processing system uses the processing elements, with their local memory and 
interconnect topology to connect all processors to each other Embedded within the PME is our fully 
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distributed VO programmabte router. Our system also provktss an addition la the system which provides the 
ability to toad and unload aJ! the processing elements. With our zipper we provide a method Tor loading and 
unloading of tte array of pes and thus enable implementation of a fast I/O along an edge of the array's 
rings. To provide for external interface I/O any subset of the rings may be broken {un-zippsd across some 

s dimension^)) with the re su Hani broken paths connected to the external Interface. The co-pending applica- 
tion entitled 'APAP \tO ZIPPER", filed concurrently herewith, USSN , filed May 22.1992. 
describes our 'zipper* in additional detail. The 'zipper 1 can be applied to only the subset of links required to 
support the peek external I/O load, which in all configurations considered so far leads to its being applied 
only to one or two edges of the physical design. 

?o The final type of VO consists of data that must be broadcast to, or gathered from all PMEs, plus data 
which is too specialized to fit on the standard bused. Broadcast data includes commands, programs and 
data. Gathered data is primarily status and monitor functions vmiie diagnostic and test functions are the 
specialized elements. Each node, in addition to the included sat of PMEs. contains one Broadcast and 
Control Interface (BCI) section. 

rs Consider PMEs interconnected in a modified 4 dimensional hypercube network. If each ring contains 16 
PMEs, (hen the system will have 32,768 PMEs. The network diameter is 10 steps. Each PME contains In 
SfW the router and reconfiguration S/W to support a particular outgoing port Thus, software routing 
provides the capability to reconfigure in the event of a faulty processing element or node. Inherent In a 4d, 
25 Mhz network design wllh byte wide half duplex rings is the provision for 410 gigabytes per second peek 

so internal bandwidth. 

The 4 dimensional hypercube leads to a particularly advantageous package. Rghr of the PMEs 
(Including data flow, memory and I/O paths and controls) are encompassed in a single chip. Thus, a node 
will be a single chip including pair* of elements along the rings. The nodes are configured together in an 8 
X 8 array lo make up e cluster. The fufly populated machine is built up of on orray of 0 X 8 clusters to 

29 provide the maximum capacity of 32.768 PMEs. 

Each PME is a powerful microcomputer having significant memory and VO functions. There is mutiibyf© 
data flow wlihin a reduced instruction set (RISC) architecture Each PME has id bit internal data flow and 
eight levels of program Interrupts wfch the use of working and general registers to manage data Now. There 
is a circuit switched and store and forward mode for I/O transfer under PME software control. The SIMD 

30 mode or MIMD mode Is under PME software control. The PME can execute RISC Instructions from either 
the BCI in a SIMD mode, or from He own main memory in MIMD mode. Specific RISC instruction code 
points can be reinterpreted to perform unique functions m the SIMD mode. Each PME can implement an 
extended Instruction Set Architecture and provide routings which perform macro level Instructions such as 
extended precision fixed point arithmetic, floating point arithmetic, vector arithmetic, and the like. This 

as permits not only complex math to be handled but image processing activities for display of image data in 
multiple dimensions (2d and 3d images) and (or multimedia applications. The system can select groups of 
PMEs for & function. PMES assigned can allocate selected data and instructions for group processing. The 
operations can be externally monitored via the BCI. Each BO has a primary control input, a secondary 
control Input end a status monitor output for the node. Within a node the 2n PMEs can be connection for a 

40 binary hypercube ccmmunicaaon network within the chip. Communication between PMEs is controlled by 
the bits in PME control registers under control of PMG software. This permits the system to have a virtual 
routing capability. Each PME can step messages up or down Its own right or to Its neighboring PME In 
aimer of two adjacent rings. Each interface between PMEs Is a point-to-point connection. The I/O ports 
permit off-chip extensions of the internal ring to adjacent nodes of the system. The system is bu3t up of 

*$ replications of a node to form a node aray, a cluster, and other configurations. 

To complement our system's SIMD, MiMO, SlMD/MIMD and SIMIMD functionality, our development we 
have provided unique operational modes. Among our SIMD/MIMD PME*s unique modes are Ihe new 
functional features referred to ae the "store and forward / circuit switch" functions. These hardware functions 
complemented with the on chip communication and programmable interna) and external I/O routing provides 

so the PME wKh very optimal data transferring capability. In preferred mode of operation the processor 
memory is gsneraHy the data sink for messages and data targeted at the PME in the store and forward 
mode. Messages and data not targeted for the PME are sent directly to the required output port when In 
circuit switched mode. The PME software performs the selected routing path while giving the PME a 
dynamically selectable store and forward / circuit switch functionality, 

ss Among the advances we have provided is a fully distributed architecture for PMEs of a node. Each 
node has 2n processors, memory and I/O. Ever/ PME will provide very flexible processing capability with 
16 bit data flow, 64K byte* of local storage, store and forwartaircuit switch logic, PME to PME 
communication. SIMD/MMD switching capabilities, programmable routing, and dedicated floating point 
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assist logic. The organisation of every PME and its communication paths with otter PMEs wfthtn the sams 
chip to minimize chip crossing delays. PME functions can be independently operated by the PME and 
integrated vyHh function* hi the node, a cluster, ana larger arrays. 

Our massively parallel system is made up of nodal building blocks of mua'processor nodes, clusters of 

9 nodes, and arrays of PMEs already packaged In clusters. For control of these packaged systems whe 
provide a system array director which with the hardware controllers performs the oveiell Prooeseif>g 
Memory Element iPMEj Array Controller functions In the massively parallel processing environment The 
Director comprises of tf>ree functional areas* the Application Interface, the Cluster Synchronizer, and 
normally a Cluster Controller. The Array Director win have the overall control of the PME array, using ftie 

10 broadcast bua and our zipper connection to steer data and commands to all of the PMEs. The Array 
Director functions as a software system interacting with the hardware to perform the role as the shell of the 
APAP operating system. 

The interconnection for our PMEs Tor a massively parallel array computer SIMD/MIMD processing 
memory element {PME> Interconnection provides the processor to processor connection In the massively 

/s parallel processing environment Each PME uafizes our fully distributed rrtterproceesor communication 
hardware from Hie on-chip PME to PME connection* to the off-chip I/O facilities which support the chip-to- 
chip interconnection. Our modified topology limits our cluster to cluster wiring while supporting the 
advantages of hypercube connections. 

The concepts which we employ for a PME node are related to the VLSI packaging techniques used for 

so the Advanced Parallel Array Processor (APAP) computer system disclosed here, which packaging features 
of our invention provide enhancements to the manufacturing ability of me APAP system. These techniques 
ere unique in the area of massively parallel processor machines and will enable wo machine to he 
packaged and configured In optimal subsets that can be built and testsd. 

The packaging techniques lake advantage of the eight PMEs packaged in a single chip and arranged in 

& a N-qlmensionai modified hypercube configuration. Tnis chip level package or node of the array le the 
smallest building block the APAP design. These nodes are then packaged in an 8 x 8 array where the +- 
X and the makes rings wiWn ihe array or cluster and the *-W> and +-Z are brought out to the 
neighboring clusters. A grouping of clusters make up an array. The intended applications for APAP 
computers depend upon the particular conjuration and host Large systems attached to mairtframes with 

so effective vectorized floating point processors might address special vectorizable problems - such as weather 
predtc&on, wind funnel simulation, turbulent fluid modeling and finKe element modeling. Where these 
problems invoke sparse matrices, significant work must be done to prepare the data for vectorized 
arithmetic and likewise to store results. That workload would be off loaded to Ihe APAP, In intermediate Bfcze 
systems, the APAP might be dedicated to perron* ing the graphics operations associated wtfh visualization, 

38 or with some preprocessing operation on incoming data (j.e., performing optimum assignment problems in 
military sensor fusion appHcattons). Small systems attached to workstations or PCs might serve as 
programmer development stations or might emulate a vectorized floating point processor attachment or a 
3d graphics processor. 



O BRIEF DESCRIPTION OF THE DRAWINGS 



FIG. 1 shows a parallel processor processing element tike those which would U81l2e old technology. 

FIG. 2 stows a massively parallel processor building block in accordance wrih our invention, 
representing our new chip design. 
<a FIG. 3 illustrates on the right side the preferred chip physical cluster layout for our preferred 
embodiment of a chip single node fine grained parallel processor. There each chip is a 
scalable parallel processor chip providing 5 MIPe performance with CMOS ORAM memory 
and logic permitting air cooled implementation of massive concurrent systems. On the left 
side of Figure 3, there ie illustrated the replaced technology, 
so FIG. 4 shows a computer processor functional block diagram In accordance with the Invention. 

FIG. 5 shows a typical Advanced Parallel Array Processor computer system conhguretion. 

FIG. 6 shows a system overview of our fine-grained parallel processor technology In accordance 
with cur invention, Illustrating system build up using replication of the PME element which 
permits systems to b« developed with 40 to 193,840 mips performance, 

55 FIG. 7 illustrates ihe hardware tor the processing element (PME) data flow and local memory in 
accordance with our invention, while 
FIG. S illustrates PME data How where a processor memory element Is configured as a hardwired 
general purpose computer that provides about 5 MIPS fixed point processing or A Mflops via 
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programmed control floating point operations. 
FIG. 9 shows the PME to PME cormeeUon (binary hyperoibe) and data paths that can be taken In 

accordance with our invention, white 
FIG, 10 illustrates node irrterconnecttons tor the chip or node which has 8 PMEs, each of which 
s manages a single external port end permits distribution of the network control function and 

eliminates a functional hardware port bottleneck. 
FIG, II is a block diagram of a scalable parallel processor chip where each PME is a 16 bit wide 

processor vrtth 32 K woids of local memory and there Is I/O porting for a broadcast port 

which provides a ccrtroller-to-aJI Interface while external porta are W-dlrectional point-to-point 
70 interfaces permitting ring torus connections within the chip and externally. 

Fi-3. 12 shows an array director in the preferred embodiment 

FIG, 13 in part (a) Illustrates the system bus 10 or from a cluster array coupling enabling loading or 
unloading of the array by connecting the edges of dusters to the system bus (see FIGURE 
14). In FIGURE 13 In pert {bl there is the bus to/from the processing element porter 
is FIGURE 13 illustrates how multiple system buses can be supported with multiple clusters . 

Each cluster can support SO to 57 Mbyte's bandwidth. 

FIG. 14 shows a w zippei " connection for fast \fO cora«ctton. 

FIG, 15 shows an 0 degree hypercube connection illustrating a packaging technique in accordance 
with our invention applicable to an 8 degree hypercube. 

20 FIG. 19 shows two independent node connections in the hypercube. 

FIG. 17 shows the Grtonic Sort algorithm as an example to illustrate ihe advantages of the defined 

Simd/mimd processor system 
FIG, id ilustrates a system bloc* diao/am for a host attached large system wftn one application 
processor inter fees illustrated. This illustration may also be viewed with the understanding 

25 that our Invention may be employed in stand alone systems which use multiple application 

processor interfaces. Such interfaces in a FIGURE 18 configuration will support 
0A80&raphic8 on all or many dusters, Workstation accelerators can eliminate the host, 
application processor interface (API) and cluster synchronizer (CS) Illustrated by emulation. 
The cs is not required in ail instances. 

aj FIG. 19 IBuslrates the software development environment lor our system. Programs can be prepared 
by and executed feom ihe host application processor. Boih program and machine debug is 
supported by the workstation based console illustrated here and in FIGURE 22- Both of these 
services will support applications operating on a real or a simulated MMP. enabling applica- 
tions to be developed at a workstation level as well as on e supercomputer formed of the 

33 APAP MMP. The common software environment enhances programmability and distributed 

usage. 

FIG. 20 illustrates the programming levels which are permfitad by the new systems. As different 
users require more or less detailed knowledge, the software system 1 3 developed to suppori 
this variation. At the highest level the user does not need to know the architecture Is Indeed 
an MMP, The system can be used with existino, language systems for partitioning oi 

programs, such as parallel Fortran. 

FIG. 21 llustrates the parallel Fortran compiler system tor the MMP provided by the APAP conigura- 
Dons described. A sequential to parallel compiler system uses a combination of existing 
compiler capability with new data allocation functions and enables use of a partitioning 
43 program like FortranD, 

FIG. 22 illustrates the workstation application of the APAP, where the APAP becomes a workstation 
accelerator. Note that the unit has the earns physical size as a RISC/3000 Model 530< bui 
this model now contains an mmp Yvtuch is euached to the workstation via a bus extension 
module illustrated, 

so FIG. 23 llustrates an application for an APAP MMP module for an AvVACS military or commercial 
application. This is a way of handling efficiently the classical distributed sensor fusion 
problem shown In FIGURE 23, where the observation to track matching Is classically done 
with well know algorithms like nearest neighbor* 2 dimensional linear assignment (Munkes..). 
probabilistic data association or multiple hypothesis testing, but these can now be done in an 

63 improved manner as illustrated by FIGURES 24 and 25, 

FIG. 24 Iiustrate3 how the system provides the ability to handle n-dlmenslonal assortment problems 
In real time, 

FIG. 25 ilustrates processing flow for an rKfirnensional aesignment problem utilizing an APAP. 
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FIG. 26 illustrates the expansion unit provided by the system encbsure described shewing how a 
unU can provide 42<1 MflopS or 5120 MIPS using only 8 to 10 extended SEM-E modules, 
pi-oviding the performance comparable to that of specialized signal processor module in only 
.6 cubic feet This system can become a SIMD massive machine with 1024 parallel 
3 processors performing two billion operations per second (COPS) and can grow by adding 

1024 addraonal processors end 32MB additional storage. 
FIO. 27 illustrates me APAP packaging for a supercomputer. Here is a large system of comparable 
performance but much smeller footprint than other systems. K can be built by replicating the 
APAP cluster wimin an enclosure Oka those used for smaller machines, 
10 We have provided, as pari of ihe description, Tables Itust rating ihe hardwired instructions for a P ME, in 
which Table 1 illustrates RxeoVpoirn arithmetic instructions: Table 2 illustrates storage to storage Instruc- 
tions; Table 3 itiustraies logical instructions; Table 4 illustrates shft instructions; Tsbte 5 illustrates branch 
instructions; Table 6 illustrates Ihe status switching instructions; and Table 7 illustrates the inputfoutput 
instructions. 

;s (Note: For convenience of HJustraoon in the formal patent drawings, FIGURES may be separated In parts 
and as a convention we place the top of the FIGURE as the first sheet with subsequent sheets proceeding 
down and across when viewing the FIGURE, in the event that multiple sheets are used-) 

Our detailed description follows with parte explaining the preferred embodiments of our invention 
provided by way of example. 

DETAILED DESCRIPTION OF THE INVENTION 

Turning now to our invention in greater detail, it will be seen from FIGURE I, which illustrates me 
existing technology level, illustrated by the transputer TOGO chip, and representing similar chips for such 

& machines as tt>e illustrated by the Touchstone Delta (1690). N Cube f38$). and others. When FIGURE 1 is 
compared with ihe developments here, it will be seen that not only can systems like the prior systems be 
substantially Improved by employing our invention, but also new powerful systems can be created, as we 
wBI describe. RGURE i's conventional modem microprocessor technology consumes pins and memory. 
Sandwitteh is limited end inter-chip communication dregs the system down. 

30 The new technology leapfrog represented by FIGURE 2 merges processors, memory, VO into multiple 
PMEs (eight c» mo?e 16 bit processors each of which has no memory access delays ana uses all the pins 
for networking) formed on a single low power CMOS DRAM chip. The system can make use of ideas of our 
prior referenced disclosures as wen as Invention separately described In the applications filed concurrently 
herewith and applicable to the system we describe here. Thus, for this purpose they are hKOrporated herein 

3S by reference. Our concepts of grouping, autonomy, transparency, zipper interaction, asynchronous 8IMD, 
Simimd or $imd/mimd. can all be employed with the new technology, even though to lesser advantage 
they can be employed in the systems of the prior technology and in combination with our own prior muftipis 
picket processor. 

Our picket system can employ the present processor. Our basic concept is that we have now provided 

49 a repticable brick, a new basic building block for systems with our new memory processor, a memory unit 
having embedded processors, router and I'O, This bask building block is scalable. The basic system which 
we have implemented employs a 4 Meg. CMOS DRAM. It is expandable to be used in larger memory 
configurations, with i6Mblt DRAMS, and 64Mbit chips by expansion. Each processor is a gate array. With 
denser deposition, many more processors* at higher clock speeds, can be placed on the same chip, and 

4$ using gates and additional memory wlH expand the performance of each pme. Scaling a single parr type 
provides a system framwork and architecture which can have a performance weH into the PETAOP range. 

FIGURE 2 Illustrates the memory processor which we call the PME or processor memory element in 
accordance with our preferred embodiment The processor has eight or more processors. In the pictured 
embodiment there are eight The chip can be expanded (horizontally) to add more processors* The chip 

so can, as preferred, retain ihe logic and expand the DRAM memory with additional ceils linearly (vertically). 
Pictured are 16 - 32k by 9 bit sections of DRAM memory surrounding a field of CMOS gate array gates 
which Implement S replications of a 10 bit wide data flow processors. 

Using IBM CMOS low power sub-micron BM CMOS deposition on silicon technology, it uses selected 
silicon with trench to provide significant storage on a small chip surface. Our memory and multiple 

5« processors organized interconnect is made with IBM's advanced art of making semiconductor chips. 
However, it will be recognized that the Bttie chip we <fe9crlbe has about 4 Meg. memory. H is designed so 
that as 16 Meg. memory technology becomes stable, when improved yields and methods or accommodat- 
ing defects are certain, our little chip can migrate to larger memory sizes each 9 bits wide without changing 
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the logic. Advances m photo and X-ray renography keep pushing minimum fsaturs size to well below .5 
microns. Our design envision 5 more progress. Those advances will permit placement of very large amounts 
of memory with processinG on a single silicon chip. 

Our device is a 4 MEG CMOS DflAM believed to be tho first general memory chip with extensive room 

s lor logic. 18 replications of a 32k by 9-bli DRAM macro make up the memory array. The DRAM has 120K 
c&rjs it allocates with significant surface area for application logic on me chip, with tripia level metal wiring. 
The processor logic cells are preferably gate array cells. The 35 ns or less DRAM access time marches the 
processor cycle time. This CMOS Implementation provides logic density for a very effecflve PE (picket) and 
does 90 while dissipating 1.3 wstrs for the logic. The separate memory section of the chip, each 32K by 9 

?o bits, (with expansion nol changing logic) surrounds Ihe field of CMOS gate array gates representing 120K 
cells, and having the logic described in other figures. Memory Is barriered and with a separated power 
source dissipates ,9 waits. In providing the combining of significant amounts of logic on the seme silicon 
substrate win significant amounts of memory problems involved wiih the electrical noise incompatibility oi 
logic and DRAM have been overcome. Logic tends to be very noisy vrfiile memory needs relative quiet to 

*5 sense the millivolt size signats that result from reading the cells of DRAM. We prefer to provide trenched 
triple metal layer silicon deposition, wrtfi separate barriered portions of the memory chip devoted to memory 
and to processor logic with voltage and ground isolation, and separate power distribution and barriers, to 
achieve compatibility between logic and DRAM. 

ao apap System Overview of Prefe»efl Embodiments 

This description introduces the new technology In the following order 

1, Technology 

2. Chip H/W description 

29 3. Networking and system build up 

4. Software 

5. Applications 

The initial sections of the detailed description describe how 4-Meg dram low power CMOS chips are made 
to include & processors on and as port of me manufactured PME DRAM chips each supporting: 
so f. 16 bit &MIP dataflows, 

2. independent instruction stream and interrupt processing and 

3, S bit (plus parity and controls) wide external port and interconnection to 3 ether on chip processors. 
Our invention provides multiple functions which are Integrated into a single chip design. The chip will 

provide PME functions which a*e powerful and flexible and sufficiency so such that a chip having scalability 
as will be elf active at processing, routing, storage and three classes of I/O. This chip has integrated memory 

and control logic within trie single chip to msfce the pme, and this combination Is replicated within the chip. 

A processor system is built from replications or the single chip. 

The approach partitions the low power CMOS DRAM, ll wit) be formed as mutlipie word length <18) bit 

by 32K sections, associating one section with a processor. (We use the term PME to refer to a single 
#t processor, memory and I/O capeble system unit.) This parffiioning lecds to each DRAM chip being an 8 

way 'cube connected' MfMD parallel processor with 8 byte wide independent interconnection ports. (See 

FIGURE $ for an illustration of a replication of fine-grained parallel technology, illustrating replication and the 

ring torus po&siDflriies.) 

The software description addresses several distinct program types. At ihe lowest level, processes 
<*s interface the user's program (or services caned by the application) to the detailed hardware H/w needs. 
This level includes the tasks required to manage the k'O and interproceseor synchronization and is what 
might be called a microprogram for lite MPP. An intermediate level of services provWe for bom mapping 
applications {developed with vector or matrix operations) to the MPP, and also control, synchronization, 
startup, diagnostic functions. At the host level, high order languages are supported by library functions that 
so support sectorized programs with either simple automatic data allocation io the MPP or user tuned data 
allocation. The multMevel software S?W approach permits applications to exptoft different degrees of control 
and optimization within a single program. Thus, a user can code applcstlon programs without understanding 
the architecture detail while an optimizer might tune at the microcode level only the small high usage 
kernels of a program, 

55 Sections of our description that describe 1024 element 5 GIPS units and a 35,766 element 164- GlP8 
unit illustrate the range of possible systems. However, those are not the limits; both smaller and larger units 
are feasible* These particular sizes have been selected as examples because the small unit is suitable to 
microprocessors (accelerators), personal computers, workstation and military applications (using of coarse 
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different packaging techniques), while the larger unit is iltustarave of 8 mainframe application as a module or 
complete supercomputer system. A software description will provide examples of other challenging work 
that might be effectively programmed on each of me illustrative systems. 

s PME PRAM CMOS - A BASE FQR A MULTIPROCESSOR PME 

FIGURE 2 illustrates our technology improvement at che chip technology level. This extendable 
computet organization Is very cost end performance efficient over the wide range of system sizes because 
it uses only one chip type. Combining the memory and processing on one chip eliminates the pine 

70 dedicated to the memory bus and (heir associated reliability and performance penalties. Replication of our 
design withm the chip makes it economically feasible to consider custom logic designs for processor 
subsections. Replication of the chip within the system leads to large scale manufacturing economies. 
Finally. CMOS technology requires low power per Ml P. which in turn minimizes power supply and cooling 
needs. The chip architecture can be p?ogremmed for multiple word lengths enabling operations to be 

;s per rormed that would otherwise require much larger length processors. In combinoaon these attributes 
permit the extensive range of system performance. 

Our new technology can be compared with a possible extension of the old teclwiolOQy ft overlaps, it is 
apparent thai the advantages of smaller features have been used by processor designers *> construct more 
complex chips and by memory designers to provide greater replication of the simple element. If the trend 

20 continues one could expect memories to get 'our times as large while processors might exploit density to: 

1 . include multiple execute units with instruction routers, 

2. increase cache sizes and associative capability and/or 

3. increase instruction look ahead and advance computation capaoittty. 

However, these epproeches to ihe old technology illustrated by FIGURE 1 all tend to dead end. 

2S Duplicating processors leads to linearly Increasing pin requirements bus pins per chip is fixed. Better cache* 
ing can only exploit the application's data reuse pattern. Beyond that, memory bandwidth becomes the limit. 
Application data dependencies and branching limit rim potential advantage of look ahead schemes. 
Additionally, It Is not apparent that MPP applications with ftne-grelned parallelism need 1, 4. or 16 
Megaword memories per processing unit Attempting to share such large memories between multiple 

30 processors results In 9evere memory bandwidth limitations. 

Our i>ew approach is not dead ended. We combine both significant memory and I/O and processor into 
a single chip, as illustrated by the FIGURE 2 and subsequent illustration and description. It reduces pert 
number requirements and eliminates the delays associated with chip crossing, More Importantly, this 
permits all the chip T s I/O pins to be dedicated to interprocessor communication and thus* maximizes 

35 network bandwidth. 

To implement our preferred embodiment Illustrated in FIGURE 2 we use a process thai is available now, 
using IBM low power CMOS technology. Our illustrated embodiment can be made with CMOS DRAM 
density, m CMOS and can bo implemented in denser CMOS. Our illustrated embodiment of 32K memory 
cells tor each of 8 PHEs on a chip can be increased a9 CMOS becomes denser. In our embodiment we 

40 utifee the reel estate end process technology for a 4 MEG CMOS DRAM, end expand mis with processor 
replication associated wtih 32K memory on Jhe chip itself. The chip, it wifl be seen has processor, memory, 
and I/O in each of the chip packages of the cluster shown in FIGURE a Within each package Is a memory 
with embedded processor element, router, and I/O. all coroalnee In a 4 MEG CMOS DRAM believed to be 
the first general memory chip with extensive room for logic. U uses selected silicon wtft trench to provide 

<$ significant storage on a small chip surface. Each processor chip of our design alternatively can be made 
with 18 replications of a 32K by 9 bft ORAM macro (35/80 ns) using .07 micron CMOS logic to make up the 
memory array. The device Is unique in that H allocates surface area for 120 K cells of application logic on 
the chip, supported by the capability of triple level metal wiring. The multiple cards of the old technology ie 
shown crossed out on the leit side of FIGURE 3. 

so Our basic repltcable element brick technology is an answer to the old technology. If one considered the 

"Xed" technology on the left of FIGURE 3w one would see too many cwps, too many cards, and waste. For 
example, Today's proposed teraflop machines thai others offer would have literally a million or more chips in 
them. With todays other technology only a few percent of these chips, at best are vuly operations 
producers. The rest are "overhead 1 ' (typically memory, network interface, etc.). 
58 It will become evident that it is not feasible to package ouch chips, in such a large number, in anything 
that must operate in a constrained eiwlronment of physical stee. (How many could you nt in a srnai area of 
a cockpit?) Furthermore, such proposed tersrHop machines of others, already huge, must scale i*> 100Gx 
times to reach the petaop range. We have a solution which dramatically decreases the percent of non- 
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operadons producing chjps. We provide Increased bandwidth. We provide this within a reasonable network 
Dimensionality. With such a brick technology* where memory becomes the operator, and networks are need 
for passing control*, where operation* producing chips are dramatically increased, In addition, the upgrade 
dramatically reduces the number of different types of chips. Our system is designed for soate-up, without a 

s requirement for spedalbed packaging, cooling, power, or environmental constraints. 

wnh our brick technology. uSHzino insisad of separate processors, memory units wrth bum in 
processors and network capability, ihe configuration shown in FIGURE 3, representing a card, with chips 
which are pin computable with current 4Mblt DRAM card* at the connector level. Such a single card could 
hold, with a design point ot a basic 40 mip per chip performance level, 32 chips, or 1280 mips. Four such 

w cards would provide 5 gips. The workstation configuration which is illustrated woupd preferably have such a 
PE memory array, a cluster controller* and an IBM RISC System/6000 which has sufficient performance to 
run and monitor execution of an array processor application developed at the workstation. 

A very gate efficient processor can be used in the processor portion. Such designs for processors have 
been employed, but never within memory, todeed* m addition, we have provided the ability to mix mimd 

>5 and SIMD basic operation provisions. Our chip provktes a "broadcast bus" which provides an alternate peth 
Into each CPU's mstruction buffer. Our cluster controller issues commands to each of the PEs in the PMEs, 
and these can be stored in the PME to control their operation in one mode or another. Each PME does not 
have to store an entire program, but can store only those portions applicable to a given task at various 
limes during processing of an application. 

20 Given the basic device one can elect to develop a single processor memory combination. Alternatively, 
by using a more simple processor and a subset of ihe memory macros one can design for either 2. 4-, 0 or 
16 replications of (ho basic processing clement (PME>> The PME can be made simpler either by adjusting 
the dataflow bandwidth or by substftuf ng processor cycles for functional accelerators. For most embodi- 
ments we prefer lo make 8 replications of the bask: processing element we describe. 

25 Cvr application studies neve indicated that tor now the moot favorable answer is 8 repBcations of a 10 
bit wide data flow and 32K word memory. We conclude this because: 
1 . 16 bit words permit single cycte fetch of instructions and addresses. 

2. 8 PMEs each with an external port permits 4 dimensional torus Interconnections. Using 4 or 8 PMEs 
on each ring leads to modules suitable lor the range of targeted system performances, 
3D 3. 8 external ports requires about 50% of the chip pins, providing sufficient remainder tor power, ground 
and common control signals. 

4. S Processors implemented in a 64 KByte Main Store 

a. allows for a register based architecture rather than a memory mapped architecture, and it 
b* forces some desirable but not required accelerators to be implemented by mufiipJe processor 
as cydes. 

This last attribute is important because it permits use of the developing logic density increase. Our new 
accelerators (ex. floating point arithmetic unit per PME) are added as cmp hardware without affecting 
system design, pins and cables or application code. 

The resultant chip layout and stee (14.59 x 14.03 mm) is shown In FIGURE 2, and FIGURE 3 shows a 
duster oi such chips, which can be packaged in systems like those shown in later FIGURES for stand alone 
unite, workstations which slide next to a workstation host with a connection bus, in AWACs applications, and 
In supercomputers. This chip technology provides a number of system level advantages- It permits 
development of the scalable MPP by basic repltesiion of a single part type. The two DRAM macros per 
processor provide sufficient storage for both data and program. An SRAM of equivalent size might consume 

45 more than 10 times more power. This advantage permits mimd machine models rather than the more 
limited SIMD models characteristic of machines with single chip processor/memory designs. The 35 ns or 
lass DRAM access time matches (he expected processor cycle time. CMOS logta provides the logic density 
for a very effective PME and does so whae dissipating only 1.3 watts. (Total chip power is 1.3 + .9 
(memory; = 22 w,) Thoss features in turn permit using the chip in MIL applications requiring conduction 

so cooling. (Air cooing in non-MIL applications Is significantly easier.) However, the air cooled embodiment can 
be used for workstation ejrd other environments, A standalone processor might be configured wnh an 60 
amp - 5 volt power supply. 

Advanced Parallel Array Processor {APAP) bunding Weeks are shown in FIGURE 4 and In FIGURE 3. 
FIGURE 4 illustrate* m functional block diagram of tne Advanced Parallel Array Processor. Multiple 

69 application interfaces 150, 180, 170, 180 wrist for the application processor 100 or processors 1 10, 120, 
130. FIGURE 5 Illustrates the basic building Weeks that can be configured Into different system block 
diagrams. The APAP. in a maximum configuration, can incorporate 32,768 idanticfiJ PMEs. The processor 
consists of the PME Purcay 280, 290. 300, 3 10. an Array Director 250 and an Application Processor Interface 
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260 for the application processor 200 or processors 21 0, 220. 230. The Array Director 250 consists of three 
functional untie: Application Prooeoocr Interface 260. cluster Synchronizer 27D and cluBior CorcroJIer 270. 
An Array Director car* perform the functions of the array controller of our prior linear picket system for SIMD 
operations with MIMD capability, Tha cluster controller 270, along with a sot of 64 Array clusters 200, 290, 

5 300, 310, (le. cluster of 512 PMEsJ. is die basic building block of the APAP computer system. The 
elements of tha Array Director 250 pewit? configuring systems writ* 6 wide range of cluster replications. This 
modularity based upon strict replication of berth processing and control elements is unique to this massively 
parallel computer system. In addition, the Application Processor interface 260 supports the TesVDebug 
device 240 which v*ili accomplish important design, debug, and monitoring functions, 

10 Controllers are assembled with a watt-defined interface, e.g. IE Ms MicroChannel used in other systems 
today* including controllers wfth i860 oroce$sors. Reld programmable gate array? add functions to the 
controller which can be changed to meet a particular configuration's requirements (how many PMEs there 
are. their couplings, etc.) 

The PME arrays 260, 280, 300, 310 contain the functions needed to operate as either SIMD or MIMD 
;s devices. They also contain functions that permit the complete set of PMEs to be divided into 1 to 256 
distinct subsets, When divided into subsets the Array Director 250 Interleaves between subsets. The 
sequence of the interleave process and me amount of control exercised over each subset is program 
controlled. This capability to operate distinct subsets of the array in one mode, i.e., MIMD with differing 
programs, while other sets operate m tightly synchronized SIMD mode under Array Director control, 
20 represents an advance in the art. Several examples presented later illustrate the advantages of tl* concept 

Array Architecture 

The set of nodes forming the Array is connected as a n-dirnensional modified hyper cube. In that 
20 Interconnection scheme, each node has direct connections to 2n other nodes. Those connections can be 
either simplex, half-duplex or fun-duplex type paths. In any drfienston greaser than 3d. the modified 
hypercube is a new concept in interconnection techniques (The modified hypercube in the 2d case 
generates e torus, and in the 3d case an crthogonafly connected lattice with edge surfaces wrapped to 
oppose surface.) 

30 To describe tha interconnection scheme for greater than 3d cases requires an inductive description. A 
set of mi nodes can be interconnected as a ring. (The ring could be Msimpry connected 1 , 'braided*, 'cross 
connected', 'tally connected*, eta. Although additions! node ports are needed for greater than simple rings, 
that added complexity does not affeci the modified hypercube structure.) Tlte ma rings can (hen be linked 
together by connecting each equivalent node tn the m 2 set of rings. The result at this point is a torus. To 

?s construct a i+ td modified hypercube from an id modified hypercube, m i4 > sets of id modified hypercubes 
and interconnect all of the equivalent mi level nodes into rings. 

This process is illustrated for the 4d modified hypercube. using mi » 8 for i - i,.4 by the illustration in 
FIGURE 6\ Compare our description under node Topology and also FIGURES 8, 0, 10, 15 and 18. 

FIGURE 9 illustrates the fine-grained parallel technology path from the single processor element 300, 

40 mada up of 32K ie-bft words with a 16-Wt processor to the Network nods 3f 0 of eight processors 312 and 
their associated memory 31 1 with their fully distributed I/O routers 31 3 and Signal I/O ports 314, 315, on 
through groups of nodes labeled clusters 320 and Into the cluster configuration 360 and to the various 
applications 330, 340, 350, 370. The 2d level structure is the cluster 320, and 84 clusters are Integytfed to 
form the 4d modified hypercube of 32.788 Processing Elements 360. 

Processing Array Etement (PME) Prefened Embodimerit 

As illustrated by FIGURE 2 and FIGURE 11 the preferred APAP has a basic building block of a one 
chip node. Each node contains S identical processor memory elements (PMEs) and one broadcast and 

so control Interface (BO). While some of our inventions may be implemented when all functions are not on the 
same chip, ft is important from a performance and cost reduction standpoint to provide the chip as a one 
chip node whh die 6 processor memory elements using the advanced technology which we have described 
and can be implemented today. 

The preferred implementation of a PME has a 64 KByle main store. 16 16-bit general registers on each 

55 Of 8 program Interrupt levels, a full function arithmetic/logic unit (ALU) with working registers, a status 
register and four programmable bKftrecttoneJ I/O ports, in addition mo preferred implementation provides a 
SIMD mode broadcast interface via the broadcast and control interface <BCI) which allows an external 
controller <eee our original parent application and the description ol our currently preferred embodiment for 
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a nodal array and system wrih dusters) to drive PME operation decode, memory address, and ALU data 
Inputs. This chip can perform (he functions of a microcomputer allowing multiple parallel operations to be 
performed within i% and it can be coupled to other chips within a system of multiple nodes, whether by an 
interconnection network, a mesh or hyporcub© network, or our preferred and advanced scalable embodi- 
5 men! 

The PMEs are interconnected In a series of rings or tori in our preferred scalable embodiment. In some 
applications the nodes could be Interconnected in a mesh. In our preferred embodiment each node contains 
two PMEs In each of four tori. The tori are denoted and Z (see FIGURE 8). FIGURE 11 depicts the 
Interconnection of PMEs within a node. The two PMEs In each torus are designated by their external iO 
jo port f+W, -W, + x\ -X, + Y. -Y, + 2, -2). Within the node, there are aSso iwo rings which interconnect the 4 
+ n and 4 hi PMEs. These internal rings provide the path lor messages to move between the external tori- 

Since the APAP can be in our preferred embodiment a four dimensional orthogonal array, the internal 
rings allow messages to move throughout the array in all dimensions. 

The PMEs are self-contained stored program microcomputers comprising a main store, local store, 
j 5 operaaon decode, arfthmettc/logic unrt (ALU), working registers and InpuVOinput I/O ports. The PMEs have 
the capability of (etching and executing stored Instructions from iheir own main store in MIMD operation or 
to fetch and execute commands via the Bd interface In SIMD mode. TWs interface permits intercommuni- 
cation among the controller, the PME, and other PMEs m a system mads up of rnuhlpl© chipe. 

The BCl Is the node's interface to the external array controller element and to an array director. The 
20 BCl provides common node functions such as timers end docks. Tbe BCl provides broadcast function 
masking for each nodal PME and provides rhe physical interface and buffering for the broadcast-busno- 
PME data iranslers, and also provides the nodal Interface to system status and monitoring and debug 
elements. 

Each PME contains separate interrupt levels to support each of its poinHo-point interfaces and the 

29 broadcast Interlace* Data is Input to the PME main store or output from PME main store under Direct 
Memory Access (DMA) control. An "input transfer complete* interrupt is available for each of the interlaces 
to signal the PME software that data is present Status information is available for the software to determine 
the completion of data output operations. 

Eacn PME has a "circuit switched mode" of I/O rn which one off its four input poris can be switched 

30 tSrectfy to ones of Its four oulput ports, without having the data enter Ihe PME main store. Selection of the 
source and destination of the "circuit swiTCh" is under control of the software executing on the PME. The 
other three input porta continue to have access to PME main store functions, white the fourth input Is 
switched to an output port. 

An additional type of I'O has data that must be broadcast to, or gathered from all PMEs» plus data 

35 which is too specialized to tit on the standard buses. Broadcast data can include SIMD commands, MIMD 
programs, and $)MD data. Gathered data is primarily status and monitor functions. Diagnostic and test 
functions are ihe specialized data elements. Bach node, in addition to the included set of PMEs, contains 
one 601. During operations the BCl section monitors the broadcast interface and steens/ccllectB broadcast 
data to/from the addressed PME(s). A combination of enabling masks and addressing tags are used by the 

<*> BCl to determine what broadcast information is intended for which PMEs. 

Each PME is capable of operating in 3(M0 or in MIMD mode in our preferred embodiment. In SIMD 
mode, each instruction is ted Into the PME from the broadcast bus via the BCl. The BCl buffers each 
broadcast data word untfi aU of its selected nodal PMEs have used it This synchronizaticn provides 
accomrnodaScn of the data timing dependencies associated with the execution of SIMD commands and 

<« allows asynchronous operations to be performed rjy a PME. in mimd mode, each PME executes its own 
program from its own main store. The PMEs are initialized to the SIMD mode- For MIMD operations, th« 
external controller normally broadcasts Ihe program to each of the PMEs while Ihey are in SIMD mode, and 
then commando the PMEs to switch to MIMD mode and begin executing. Masking/tagging the broadcast 
information allows different sets of PMEs to contain different MIMD programs, and/or selected sets of PMEs 

so to operate In MIMD mode while other sets of PMEs execute In SIMD mode. In various software clusters or 
partitions these separate functions can operate independently of the actions in other dusters or partitions. 
The operation of ale Instruction Set Archrrecruro (ISA) of the PME will vary slightly depending on whether 
the PME is In the SIMD or MIMD mode. Most ISA instructions operate Identically regardless of mode. 
However, since the PME In SIMD mode does not perform branching or other control functions some code 

as points dedicated to those MIMD instructions are reinterpreted in SiMD mode to «How the PME to perform 
special operation* such as searching main memory for a match to a broadcast data value or switching to 
MiMD mode. This further extend* system flexibility of an array. 
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PME Architecture 

Basically, our preferred architecture comprises d PME wrach has a 10 bit wide data flow, 32K of 16 bit 
memory, specialized I/O parte and L'O switching paths, plus the necessary control logic to permii each PME 

s to fetch, decode and execute the 13 bit Instruction set provided by our instructor! set architecture (ISA). 
The preferred PME performs the funcfeons of a virtual router, and thus performs both the processing 
functions and data routsr functions. The memory organization allows by cross addressing of memory 
between PMEs access to a large random access memory, end direct memory for the PME. The indMduel 
PME memory can be alt local or divided into local and shared global areas programmsiically, Specialized 

10 controls and capabilities which we describe permit rapid task swtiching and retention of program state 
tofonnatton at each of the PMEs interrupt execution levels- Although some of the capabilities we provide 
have existed in other processors, their application for management of intercessor W> is unique in 
masstor/ parallel machines. An example is the integrate of the message router function into the PME itself. 
This eliminates spectaflaed router chips or development of specialized VLSI routers. We also recognize that 

13 in some instances one could distribute the function b we provide on a single chip onto several chips 
Interconnected by metaJizaUon layers or otherwise and accomplish Improvements to massively parallel 
machines- Further, as our architecture is scalable from a single node to massively parallel supercomputer 
level machines, H is possible to utilize some of our concepts at different levels. As wo will illustrate for 
example our PME data (low Is very powerful, and yet operates to make (he scalable design eKective, 

20 The PME processing memory element develops for each of the mufiipl* PMEs of a node, a fully 
distributed architecture. Every PME wftl be comprised of processing capability with 16 bit data flow, 6*K 
bytes of local storage, store end forward/circuit switch logic. PME to PME communication, SlMD/MlMD 
switching capabilities, programmable routing, and dedicated floating point assist logic These functions can 
be independently operated by the PME and integrated with other PMEs within the same chip to minimize 

29 chip crossing delays- Referring to FIOURE$ 7 and 8 we Illustrate the PME dataflow. The PME consists of 
16 bit wpcte dataflow 425. 435. 445, 455, 465. 32K by 16 btf memory 420, specialized L'O porfe 400, 410, 
480, 490 and I/O switching paths 425, plue the necessary control logic to permit the PME to fetch, decode 
and e*ecute a 16 btt reduced instruction set 430. 440, 450, 480. The special logic also permits the PME to 
perlorrn as both the processing unit 460 and data router. Specialized controls 406, 406, 407. 408 and 

30 capabilities are Incorporated to permit rapid task switching and retention of program state Information at 
each of the PMEs" interrupt execution levels. Such capabilities have been included in other processors; 
however, their application specifically for management of fnterprocessor I/O is unique in massively parallel 
machines. 6peclficaily« it permits the integration of the router function inbo (he PME without requiring 
specialized chips or VLSI development macros. 

3S 

16 bit internet data flog and control 

The major parts of the internal data flow or tho processing element are shown in FIGURE 7. FIGURE 7 
illustrates the internal data flow of the processing element. This processing element has a full 18 bit internal 

40 data flow 425. 435, 445. 455, 465, The important paths or the Interna* data flows use 12 nanosecond hard 
registers such as (ha OP register 460, M register 440, WR register 470, and the program counter PC 
register 430- Tnese registers feed the tuny distributed ALU 460 and I/O router registers and logic 405, 400, 
407, 400 for all operations. With current VLSI technology, the processor can execute memory operations 
and instruction steps at 25 Mhz, and it can build the important elements. OP register 450 4 M register 440, 

4* WR register 470> and the PC register 430 wtth 12 nanosecond hard registers. Other retired registers are 
mapped to memory bcatrons. 

As seen in FIGURE 8 lha internal data Bow of the PME has its 32K by 18 bit main store In the form ot 
two DRAM macros. The remainder of the data trow consists of CMOS gate array macros. All of the memory 
can be formed with the logic wfth low power CMOS ORAM deposition techniques to form an very Iarg9 

so scaled Integrated PME chip node. The PME Is replicated 8 times In the preferred embodiment of the node 
chip. The PME data flow oonstets of a 16 word by 1B bit genera? register stack, a muffi-taicnon 
arfthmetioloDjc unit (ALU) working registers to buffer memory addresses, memory output registers, ALU 
output registers, operation/commend, I/O output registers, and multiplexers to select inputs to the ALU and 
registers. Current CMOS VLSI technology for 4MByte DRAM memory with our logic permits a PME to 

55 execute instruction steps at 25Mhz. W* am providing the OP register, (he M register, the WR register and 
the general register stack with 12 nanosecond hard register Other required registers are mapped to 
memory locations wifftin a PME. 
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The PME data flow is designed as a 10 bit integer arithmetic processor. Specie! multiplexor pzrihs have 
been added to optimize subroutine emulation of n x 18 bit floating point operation* (n*>1). The 16 bit data 
flow permffe effect emulation of floating point operations. Specific paths within the data flow have been 
included to permit floating point operations in as little as 10 cycles- The ISA includes special cods point to 

s permit subroutines lor extended (longer than 16-bit) operand operations. The subsequent floating point 
performance is approximately one twentieth the fixed fleeting point performance This performance is 
adequate to eliminate the need for special floating point chips augmenting the PME as is characteristic a? 
other massively parallel machines. Some other processors do Include the special floating point processors 
on the same chip as a single processor (See FIGURE 1). We can enable special floating point hardware 

jo processors on the same chip with our PMEs but we would now need additional cells than is required for the 
preferred embodiment. For floating point operations, see also the concurrently filed FLOATINO POINT 
application referenced above for improvements to the IEEE standard. 

The approach developed is wall poised Lo take advantage or the normal increases in VLSI technology 
performance As circuit sfce shrinks and greater packing density becomes possible then data flow 

53 elements like base and index registers, currently mapped to memory could be moved to hardware. 
Likewise* floating point sub-steps are accelerated with additional hardware which wo win prefer for the 
developing CMOS DRAM technology as reliable higher density levels are achieved. Very importantly, W$ 
hardware alternative does not affect any software. 

The PME is initialized to SIMD mods wlih interrupts disabled. Commends are fed Into the PME 

20 operation decode buffer from the BCI. Each time an instruction operation completes, the PME requests a 
new command from the B<X in a similar manner, immediate data is requested from the BCI at the 
appropriate point In the Instruction execution cycle. Most instructions of the ISA operate Identically whether 
the PME Is (n SIMD mode or in MIMD mode, with the exception of that SIMD instructions and Immediate 
data are taken from the 6CI: in MIMD mode the PME maintains a program counter [PC} and uses that as 

29 the address within own memory to fetch a 19 bit instruction, instructions such as "Branch** which 
explicitly address the program counter have no meaning in SIMD mode and some of those code points are 
reinterpreted to perform special SIMD functions as comparing Immediate data against an area of main store 
The PME instruction decode logic permits either SIMD/WIMD operational rnodes. and PMEs can 
transition between mooes dynamically. In SIMD mode the PME receives decoded instruction information 

so and executes thai data in the next clock cycle. In MIMD mode the PME maintains a program counter PC 
address and uses that as the address within rfs own memory to fetch a 1G brr instruction. Instruction decode 
and execution proceeds as in most any other RISC type machine. A PME in SiMD mode enters MIMD 
mode when given the Information thai represents a decode branch. A PME in MIMD mode enters [he SIMD 
mode upon executing a specific instruction for the transition. 

03 When PMEs transition dynamically beiwoen SIMD and MIMD rnodes, an MIMD mode is entered by 
execution of a SIMD "write contiol register instruction with ihe appropriate control Wt set to a H At (he 
completion of the SiMD instruction, the pme enters the MIMD mode, enables interrupts, and begins 
fetching and executing its MIMD instructions from the main store location specified by its general register 
RO. Interrupts are masked or unmasked depending on the state of interrupt masks when the MIMD control 

& bit Is set. The PME returns to SIMD mode either by being externally reinitialised or by executing a MIMD 
"write control register" instruction with the appropriate control bit set to aero. 

Data communication paths sno control 

w Returning to Figure 7 it will be seen that each PME has 3 incut ports 400, and 3 output ports 430 
intended for on-chip communication plus 1 \K> port 410, 450 for off chip communications. Existing 
technology „ rather than (he processor idea, requires that the off-chip port be byte wide half duplex. Input 
ports are connected such that data may be routed from input to memory, or from input AR register 406 to 
output register 408 via direct 16 bd data pain 425, Memory would be the data sink for messages targeted ai 

so the PME or for messages that were moved in 'store and forward' mode. Messages that do not target the 
particular PME are seitf directly to the required output port, providing a 'circuit switched' mode, when 
blocking has not occurred. The PME S.V7 Is charged with performing the routing and determining the 
selected transmission mode. This makes dynamically selecting between 'circuited switched' and 'store end 
forward' modes possibls. Tnis is also another unique characteristic of the PME design. 

ss Thus, our preferred node has 8 PMEe and each PME has 4 output ports (left, Right Vertical, and 
External Three of the input ports ond three of the output ports are 16-bit wide full duplex poinHo-point 
connections to the other PMEs on the chip. The fourth ports are combined in the preferred embodiment to 
provide a half duplex point-to-point connection to an off-chip PME. Due to pin and power constraints (hat we 



22 



EP 0 570 729 A2 



have imposed to make use of the less dense CMOS we employ, the actual of r-chip interface is a byte-wide 
path which to used to multiplex two halves of the Inter-PME data word. With special "zipper* 1 circuitry which 
provides a dynamic, temporary logical breaking of hnermodai rings to allow data to enter or leave an array, 
theee external PME ports provide the APAP oxremat I/O array function, 

s For data routed to the PME memory, norma) DMA is supported such that (he PME Instruction stream 
must beoorne involved in the VX> processing only at the beginning and end 01 message. Finally, data thai re 
being 'circuit switched to an Internal output port Is forwarded without clocking. This permits single cycle 
data transfers within a chip and detects when chip crossings will occur such that the fastest but still reliable 
communication can occur. Fast forwarding utilizes forward date paths and backward control paths, ell 

io operating in transparent mode- In essence, a source looks through several stages to see the acknowledg- 
ments from the PME performing a DMA or cffrcWp transfer. 

As seen by FIGURES 7 and 8 Data on a PME input port may be destined for the local PME, or for a 
PME further down the ring. Data destined for a PME further down the ring may be stored in the locai PME 
mam memory and then forwarded by the local PME towards the target PME {store and forward}, or me local 

/s input perl may be logically connected to a particular local outpiK port (circuit switched) such that the data 
passes "transparently"' through Uie local PME on its way to the target PME. Local PME software 
dynamically controls whether or not the local PME is in "store and forward 1 ' mode or in ''circuit switched 1 ' 
mode tor any of the four inputs and four outputs, in circuit switched mode, the PME concurrently processes 
all functions except the I/O associated wHh the circuit switch; In store and forward mode the PME suspends 

20 all other processing functions to begin the 00 forwarding process. 

White data may be stored externaiy of the array in a shared memory or DA3D (with external controller), 
it may be stored anywhere in the memories provided by PMEs. input data destined lor the local PME or 
buffered In the local PME during "store and forward - operations is placed into local PME main memory via 
a direct memory access (address) mechanism associated wtlh each of the input ports. A program interrupt 

20 Is available to Indicate that a message has been loaded into PME main memory, "me local PME program 
interprets header data to determine it the data destined for the local PME is a control message which can 
be used to set up a circuits witched path to another PME. or whether it is a message to be forwarded to 
another PME Circuit switched paths are controlled by local PME software. A circuit switched path logically 
couples a PME Input paah directly to an output path without passing through any intervening buffer storage. 

3Q Since tlie ouiput paths between PMEs on the same chip have no intervening butter storage, data can enter 
the chip, pass through a number of PMEs on the chip and be loaded into a target PMEs main memory in a 
single clock cycle! Only when a circuit switch combination leaves the chip. Is en Intermediate buffer storage 
required. This reduces fire effective diameter of the APAP array by a number of unbuffered circuit switched 
paths. As a result data can be sent from a PME to a target PME In as few clock cycles as there are 

as intervening chips, regardless of the number of PMEs in the path. This kind of routing can be compared tb a 
switched environment In which at each node cycles are required to carry data on to the next node. Each of 
our nodes has 8 PMEs! 

Memory and Interrupt Levels 

40 

The PME contains 32K by 16 bit -120 dedicated storage words. This storage is completely general and 
can contain both data and program. In SiMD operations aJl of memory could be data as is characteristic of 
other SIMD massively parallel machines, (o MIMD modes, me memory is quite normal; but, unlike most 
massively parallel MIMD machines the memory is on the same chip with the PME and is thus, immediately 

is available. This then eliminates the need tor cache-tag and cache coherency techniques characteristic of 
other massively parallel MIMD machines. In the case for instance of tnmos's chip, only 4K resides on the 
chip, and external memory Interface bus and pins are required. These are etimmaied by us. 

Low order storage locations are used to provide a set of general purpose registers tor each interrupt 
level. The particular ISA developed tor the PME wee short address fields for these register references. 

so Interrupts are utilized to manage processing, I/O activities and SW specified functions (Le.. a PME in 
normal processing will switch to an interrupt level when incoming iso initiates), if the level is not masked, 
the switch is made by changing a pointer In H/W such that registers are accessed from a new section of 
low order memory and by swapping a single PC value. This technique permits fast level switching and 
permits &W to avoid the normal register save operations and also to save status within the interrupt level 

ss registers. 

The PME processor operates on one of eight program interrupt levels. Memory addressing permits a 
partitioning of the lower 676 words of memory amoung the eight levels of interrupts, 84 of these 678 words 
of memory are directly addressable by programs executing on any of the eight levels. The other 512 words 
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are pertrnoned into sight 04 word segments. Each 64 word segment is directly accessible only by programs 
executing on 11b associated Interrupt level. Indirect addressing techniques are employed tor allowing ail 
programs to access all 32K words or PME memory. 

The interrupt levels ere assigned to the input ports, the BQ, and to error handling. There is a "normal" 
s level but there 19 no "privileged", nor "supervisor'* leveL A program interrupt causes a context switch In 
which the oontenft ol the PC program counter, atafus/CDntiot register, and selected general registers are 
stored in specified main memory locations and new values for these regbrars are fetched from other 
speeded main memory locations. 

The PME data flow discussed with reference to FIGURES 7 and 8, may be amplified by reference to 
jo the additional sections below. In a complex system, the PME data flow uses the combination of the chip as 
an array node with memory, processor and VO which sends and receives messages with the BCl that we 
replicate as the basic budding block of an MMP builr with our APAP, Th3 MMP can handle many wo?d 
lengths. 

« PME Multiple length Data Fiovt Procgsang 

The system we describe can perform the operations handled by current processors w'tfi the data flow In 
the PME which is 10 bits wide. This is done by performing operations on data lengths which are multiples 
of 16 blis. This Is accomplished by doing the operation In 19 bit pieces. One may need to know the resuli 
to of each piece (i.e. was it z&o, was there a carry out of the high bits of the sum). 

Adding two numbers ol 46 bite can be an example of dae flow, m this example two numbers of 46 bits 
{ a(0-47> and bfO-47) ) are added by performing the following in the hardware: 

a<32-47) + rX32-47)->en8<32-47> step one 

29 

t) save the carry out of high bit ol sum 

2) lemember If partial result was zero or non-zero 

a> a(16-01 } * bf 1 6-31 ) + save carry->an s{1 6*31) step two 



1 ) save the carry out of high bit ol sum 

2) remember if partial result was zero or non-zero from this result and from previous step; if both are 
as zero remember *ero; if either is non-zero remember non-zero 

a(O-15)*b(0-l5;*saved car*y->ens<0-15) final step 



to 1) rf this piece fs zero and iasi piece was zero ana is zero 

2) if (his piece is zero and last piece was non-zero ans is non-zero 

3) if this piece is non-zero ans is positive or negative based on sign of sum {assuming no overflow) 

4) if carry into sign of ans os not-equal to cany out of sign of answer, ans has wrong sign and result Is 
an overflow (can not properly represent in the available bits) 

49 The length can be extended by repeating me second step in the middle as many times as required, if the 
length wvre 32 the second step would not be performed. If the lengih were greater than 48, stop two would 
be done multiple limes. If (he length were just 16 the operation in step one, with conditions 3 and 4 of the 
final step vrould be done. Extending the length of the operands to multiple lengths of the date Sow is a 
technique having a consequence that the instruction usually takes longer to execute for a narrower data 

so flow. Thai Is, a 32 bit add on a 32 bit data flow only lakes one pass through (ha adder logic, while the same 
add on a 1 6 bit data now takes two passes through the adder logic. 

What we have done that Is interesting fs that in the current Implementation of the machine we have 
single instructions which can perform axtds/subtrac^comparesmicves on operands of length 1 to 9 words 
(length is defined as part o* the instruction). Individual instructions available to the programmer perform the 

S3 same kind of operations as shown above tor steps one, two, and final (except to the programmer Ihe 
operand length is longer le. 10 to 128 bits). At the bare bones hardware level we are working on ie Ms at 
a time, but ths programmer thinks a/he's doing 10 to 128 bits at a time. 
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By using combinations of these instructions, operands of any length can be manipulated by the 
programmer Le. two instructions can be used 10 add two numbers of up to 263 bits In length. 

PME Processor 

6 

Our PME processor i3 different from modern microprocessors currently utilized for MPP applications. 
Th3 processor portion differences include: 

1. the processor Is a fuiy programmable hardwired computer (see the ISA description for en Instruction 
set overview) witte 

w o it has a complete memory module shown in the upper right comer {see FIGURE 8), 

o it has hardware registers with controls required to emulate separate register sets to each Interrupt 

level (shown in upper bit corner), 
o iis ALU has the required registers and controls to permit effective mulli-cycle integer and floating 

point arithmetic, 

;s o ft has I/O switching paths needed to support pecksi or circuit switched data movement between 

PMEs Interconnected by point-to-point links shown In lha lower right comer. 

2. This is our minimeHst approach to processor design permitting eight replications of the PME per cltfp 
with the CMOS ORAM technology. 

3. This processor portion of the PME provides about the minimum dataflow width required to encode a 
so fas? Instruction Set Architecture -see Tables - which is required to permit effective Ml MO or SIMD 

operation of our MM P. 

PME Resident Software 

20 The PME Is the smallest element of the APAP capable of executing a stored program I? can execute a 
program whfch is resident in some external control etemerri and fed to it by the broadcast and control 
interface iBCI) in SiMO mode or it can execute a program which to resident in Its own main memory <MIMD 
mode), can dynamically switch between SIMD mode and MBtfD mode representing simd/MIMD mode 
dually functions, and also the system can execute these duai&es as the same time tSIMJMD mode). A 
3Q particular PME can make this dynamic switch by merely setting or resetting a bil In a control register. Since 
SIMD PME software ts actually resident in the external control element, further discussion of this may be 
found m our discussion of the Array Director and in related applications, 

MIMD software is stored into the PME main memory while the PME ia in 6IMD mode. This la Feasible 
since many of th8 PMEs will contain identical programs because they will be processing similar data in an 
35 asynchronous manner. Here we would note that these programs are not fixed, but ihey can be modified by 
loading the MIMD program from an external source during processing of other operations. 

Since the PME instruction set archriecture represented in the Tables is ihai of a microcomputer, there 
are few restrictions with this architecture on the functions which the PME can execute. Essentially each 
PME can function like a RISC microprocessor. Typical MIMD PME software routines are listed below: 
40 1 . Basic control programs for dispatching and prioritizing the various resident routines. 

2. Communication software Jo pass data and control messages among the PMEs. This software would 
determine when a particular PME would go into/out of the "circuit switched 1 * mode, ft performs a "store 
and toward" function as appropriate. It also initiates, sends, receives, and terminates messages between 
its own main memory and that of another PME. 
« 3. Interrupt handling software completes the oontext switch, and responds to an event which has caused 
tie interrupt. These can indude fail-safe routines and rerouting or reassignment or PMEs to an array . 
A. Routines which implement the extended instruction Set Architecture which we describe below. These 
routings perform macro level instructions such as extended precision fixed point arithmetic, floating point 
anthmetic. vector arithmetic, and the like. This permits not only complex maih to be handled but image 
so processing activities for display of image data in multiple dimensions {2d and 3d images) and multimedia 
processes. 

5, Standard mathematical fibrary functions can be included. These can preferably Indude UNPAK and 
VPSS routines. Since each pme may be operating on a different element of a vector or matrix, the 
various PMEs may all be executing different routines or differing portions os the same matrix et one time, 
55 6. Specialized routines for performing acatier.fyither or sorting functions which take advantage of the 
APAP nodal interconnection structure end permit dynamic mufl^mensionei routing are provided. The 
routines effectively tafce advantage of some amount of synchronization provided among the various 
PMEs, white permitting asynchronous operations to continue. For aorta, there are sort routines. The 
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APAP is w*ll suited to a Batcher Sort. Because that sort requires extensive calculations to determine 
particular element to compare versus very short comparison cycles. Program synchronization to man- 
aged by Ihe I/O statements. The program allows multiple data etemenfc per PME and very large parallel 
sorts in quite a straight forward manner, 
s While each PME lias Its own resident software, Ihe systems made from these microcomputers can 
execute higher level language processes designed toi scalar and parallel machines. Thus the system can 
execute application programs which have been written tor UNIX machines, or those of other operating 
systems, in high level languages such as Fortran. FortranD. and so on. 

ft may be an Interesting footnote mat our processor concepts use an approach to processor design 
jo which is quite old. Perhaps thirty yeans of use of a similar ISA design has occurred in IBM's military 
processors. We have been the first to recognize that this kind of design can be used to advantage to 
leapfrog the dead ended current modem microprocessor design when combined with our total PME design 
to move ihe technology to a new path tor use in trie next century. 

Although the processor's design characterlstice are quite different from other modern microprocessors, 
>5 similar gsie constrained military and aerospace processors have used the design since the '60s. It provides 
sufficient instructions and registers tor straight forward compiler development and both general and signal 
processing applications are effectively running on tills design. Cur design has minimal gate requirements, 
and EM has Implemented some similar concepts tor years when embedded chip designs were needed 
general purpose processing. Our adoption now of parts of ihe older ISA design permits use of many utilities 
and other software vehicles which will enable adoption of our systems at a rapid rate because of the 
existing base and the knowledge ihai many programmers have of ths design concepts. 

PME t/Q 

a The PME will interface to the broadcast and control Interface (BCl) bus by etther reading data from the 
bus into me ALU via the path labeled BCl in FIGURE 8 or by torching msfrueitons irom the bus directly into 
the decode logic (not shown). The PMG powers up in SiMD mode and in that mode reads, decodes and 
executes instructions until encountering a branch. A broadcast command 3IMD mode causes the transition 
to MIMD with instructions fetched locally. A broadcast PME instruction 'INTERNAL DIOW reveres the state. 

30 PME I/O can be sending data, passing data or receiving data. When sending data, the PME sets the 
CTL register to connect XMIT to eitter L, R, V, or X. HW services fren pass a block of data from memory 
to the target via the ALU multiplexer end the XMIT register. This processing interleaves with normal 
Instruction operation. Depending upon application requirements, the block of data transmitted can contain 
raw data for a predefined PME anoVor commands to establish paths. A PME that receives data will store 

as input to memory and interrupt the active lower level processing. The interpretation task et the interrupt level 
can use the Interrupt event to do task synchronization or Initiate a transparent I/O operation (when data is 
addressed elsewhere.) During the transparent I/O operation, tos PME is tree to continue execution- Its CTL 
register makes it a bridge. Data will pass through il without gating, and it will remain in thai mode until an 
Instruction or the data stream <esets CTL While a PME is passing data It cannot be a data source, but it 

40 can be a data sink for another message. 

PME Broadcast Section 

This is a chip-to-common control device interface. It can be used by the device that servas as a 
<w controller to oornmand I/O, or test and diagnose the complete chip, 

tnput is word sequences (either instruction or data) that are available to subsets of PME a- Associated 
wHh each word is a code Indicating which PMEc wit! use the word. The the BCl will use the word both to 
limit a PMFe access to fre interface and to assure that ell required pme$ receive data. This serves to 
adjust the BCl to the asynchronous PME operations. (Even when in SIMD mode PMEs are asynchronous 
so due to I/O and interrupt processing.) The mechanism permits PMEs to be formed into groups which are 
controlled by interleaved sets of command/d8ta words received over the BCl. 

Besides delivering data to the PMEs, the BCl accepts request codes from the PME combines them, 
and transmits Ihe Integrated request. This mechanism can be used in several ways. MlMD processes can 
be inflated In a group of processors mat all end wrth an output signal, The 'AND* of signals triggers the 
69 controller to initiate a new process. Applications, in many cases, will not be able to toad ail required 3AW in 
PME memory. Encoded request to the controller win be used to acouire a SAW overlay from perhaps the 
host's storage system. 
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Ths controller uses a serial scan loop through many chips to acquire information on individual chips or 
PMEb. "mesa loops Initially Interface to the BCl but can in (he BCl be bridged to Individual PMEs. 

Broadcast and Control Interfere 

5 

The BCl broadcast and control interface provided on each chip provides a paraSei input interlace such 
that data or instructions can be sent to ths node. Incoming data is tagged with a subset identifier and the 
BCl Includes the functions required to assure that all PMEs within the node, opeietktg within the gub3et» are 
provided the data or Instruction*. The parallel interface of the SCI serves bom as a post to permit data to be 
io broadcast to all PMEs and as the instruction interface during SIMD operations. Satisfying both requirements 
plus extending those requirements to supporting subset operations is completely unique to this design 
approach. 

Our DCl parallel input interlace permits data or instructions to be sent from a control element that is 
external to the node. Tne BCl contains "group assignmenT registers (see the grouping concepts In our 

;s above application entitled GROUPING OF SIMD PICKETS) which are associated with each of the PMEs. 
Incoming data words are tagged with a group identifier and the BCl Includes the functions required to 
assure that all PMEs within the node which are assigned to the dedicated group are provided the data or 
instructions. The paraJtel interface of the BCl serves as both a port to permit data to be broadcast to the 
PMEs during MIMD operations, and as the PME InatrucUon/Jmmedlate operand Interface during SIMD 

so operations. 

The 6CI also provides two serial interfaces. The high spsed serial port will provide each PME with the 
capability to output a limited amount of status information. That data is intended to: 

1. signal our Array Director 610 when the PME, e.g, 500, has data mat needs to be read or that the PME 
has completed some operation. II passes a message to the external control element represented by the 

20 Array Director. 

2. provide activity status such that external test and monitor elements can illustrate the aiatu* oi me 
entire system, 

The standard serial port permits trie external control element means for selectively accessing a specie 
PME for monitor and control purposes. Data paa&ed over this interface can direct data from the BCl parallel 

30 Interface to a particular PME register or can select data from a particular PME register and route it to the 
Wfih speed serieJ port. These control points allow the external control element to monitor and control 
individual PMEs during initial power up and diagnostic phases. It permits Array Director to input control dara 
so as to direct the port io particular PME and node Internal registers and access points. These registers 
provide paths such mat PME of a node can output data to the Array Director, and these registers permit the 

36 Array Director to input data to the units during initial power up and dagnostc phases. Data input to access 
point can be used to control test and diagnostic operations* ie. perform single instruction slap, stop on 
compare, break points, etc. 

Node Topology 

Our modified hypercube topology connection is most useful for massively parallel systems, while other 
less powerful connections can be used with our basic PMEs* Within our initial embodiment of the VLSI chip 
are sight PMEs with fu*iy distributed PME Internal hardware connections. The Internal PME to PME chip 
configuration is a two rings of four PMEs, with each PME also having one connection to a PME in the other 

a ring. For the case of eight PMEs in a vlsi chip this is a three dimensional binary hypercube, however our 
approach in general does not use hypercube organizations wrihin the chip. Each PME also provides for the 
escape of one bus. In the initial embodiment the escaped buses term one ring are called +K +Y, +W and 
+Z> while those from the after ring ere labeled similarly except - (minus). 

The specific chip organization is referred to as me node of fte array, and a node can be in a cluster of 

30 the array. The nodes are connected using +-X and +-Y Into an array » to create a cluster. The 
dimensionality of tfid array is arbitrary, and in general greater man two which is the condition required for 
developing a binary hypercube. The clusters are then connected using *-W, *-Z Into a array of clusters. 
Again, the dimensionality of the array Is arbitrary. The result Is lie 4-dlmensional hypeicube of nodes. The 
extension to a 5-dimensionai hypercube requires the usage of a 10 PME node and usee the additional two 

58 buses, say +-EI to connect the 4-dimcnsiona! hypercube into a vector of hypercubos. We have (hen shown 
the pattern of extension to either odd or even radix hypercubes. This modified topology limits the cluster to 
cluster wiring while supporting me advantages of ths hypercube connection. 



27 



EP 0 570 729 A2 



Our v/ireabifrty and topology ccnfiguranon for massivety parallel machines has advantages in keeping 
(he X and Y dimensions wiihln our duster level of packaging, and in distributing (he W and Z bus 
connections to all the neighboring dusters. After implementing the techniques described* the product will be 
wiraablo, and manufecturabts white maintaining the inherent characteristics of the topology defined, 

s The node consists of k ' n PMEs plus the Broadcast and Control Interface (BCl> section. Here VT 
represents the number of dimensions or rings, which characterize the rnoGlfied hypercube. while 'V 
represents the number of rings that characterize the node. Although a node can contain N rings h is a 
characteristic of the concept that only two of those rings may provide escape buses. *n* and *k w 1$ limited 
in our pn3f erred embodiment, because ot the physical chip package to N«4 and k=2. This Hmftation is a 

m physical one, and different chips sals will allow other and increased dimensionality in the array. In addition 
to being a pan of the physical chip package, it is our prefen-ed embodiment to provide a grouping of PMEs 
thai interconnect a set of rings in a modified hypercube. Each node will ha«e 8 PMEs with their PME 
archiJecture and ability to perform processing and data router functions, fts such, n is the dimensfonality ol 
the modified hypercube (see following section), l.e., a 4d modified hypercube's node element would be $ 

« PMEs while a 5d modified hypercube's node would be 10 PMEs. For visualization of nodes which we can 
employ, refer to FIGURE 6 t as well as FIGURES 9 and 10 for visualization ol Interconnections and see 
FIGURE 11 for a block diagram of each node. FIGURES 15 and 16 elaborate on possible inter collections 
for an APAP. 

It will be noted thai the application entitled "METHOD FOR INTERCONNECTING AND SYSTEM OF 
20 INTERCONNECTED PROCESSING ELEMENTS 1 " Of OO-iitvetfoi David B. Rclfe, filed in the United StSt*3 
Parent and Trademark office on May 15, 1991, under USSN 0*77698.368, described the modified hypercube 
criteria which can preferably be used in connection with our APAP MMP. 

That application is incorporated by reference and describes a method of interconnecting processing 
elements in such a way that the number of connections per element can be balanced against the network 

29 diameter (worst case path length). This Is done by creating a topology that maintains many of the well 
known and desirable topological prcperSee cf hypercubes while improving rts flexibility by enumerating the 
nodes ol the network in number systems whose base can be varied. When using a base 2 number system 
this method creates the hypercube topology. The invention has fewer Interconnections than a hypercube, 
uniform connections and preserves the properties of a hypercube. These properties include: 1) large 

30 number of alternate paths* 2) very high aggregate bandwidth, and 3) well understood and existing methods 
that can be used to map other common problem topologies with the topology of the network- The result is a 
general bed non-binary hypercube with less density, (t will be understood that with the preference we have 
given to the modified hypercube approach. In some applications a conventional hypercube can be utilized. 
In connecting nodes, other approaches to a topology could be used; however, the ones we describe herein 

33 are bebeved to be new and an advance, and we prefer the ones we describe. 

The interconnection methods for the modified hypercube topology for Interconnecting a plurality of 
nodes in a network of PMEs: 

1, defines a sets of integers et, e2, e3. „. such the product of all elements equals the number of PMEs 
in the network called M while the product of all elements in the set excepting e1 and e2 is the number 

^0 ot nodes called N, and the number of elements in the set caiied m dennes ihe dimensionality of the 
network n by the relationship n = n>2, 

2, addresses a PME located by a set of indexes al. a2> ... am. where each index Is the PMEs position in 
the equivalent fcvel of expansion such that the lode* al Is In the rsnge of zero to ei-1 for J equal to 1. 2, 
torn., by the formula 

(...(a<m)"e(nv1) * a (rrr2))e{m-1) ... 8^e(l)>+e(1) 

wtere the notation a© has ihe normal meaning of the the fth element in the tea of elements caJled a* or 
squivaJently for e- 

50 3. connects two PMEs {with addresses f and g> If and only if eRher of the following two conditions hold: 

a, the integer part of rv'(ei * e2) equals the integer part ot e/(ei * e2) and: 

1. the remainder part of r/el differs from the remainder pari of s^t by 1 or, 

2. the remainder part ot rfe2 differs from the remainder pan of s/e2 by 1 or e2-i. 

b, me remainder part of r/ei differs from the remainder part of s/e> for I in the range 3. 4. ... m and the 

so remainder part of r/el equals the remainder part of Q.fe2 which equals i minus three, and the 

remainder port of r/e2 differo from the remainder part of s/e2 by e2 minus one. 
As a result the computing system nodes will form a non-binary hypercube, with the potential for being 
different radix in each dimension. The node is defined as an array of PMEs which supports 2*n ports such 
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that the ports provided by nodes match the dimensionality requirements of the modified hypercube, the 
set of integers e3. e4. em. which deHne the specific extent of each dimension of a particular modified 
hypercube are an taken as equal, say b, and if 01 and e2 are taken a 1, then the previous formulas for 
addressability and connections reduce to: 
5 1.N=b"*n 

2. addressing a PME as numbers representing the base b numbeitng system 

3. connecting two computing elements (i and g) if and only if me address of f differs from the address of 
g in exactly one base b digit using the rule that 0 and b-1 are separated by 1 - 

4. the number of connections supported by each PME is 2"n 

10 Which is exactly as described in the base application, with (he number of communication buses spanning 
noo-edfacem PMEs chosen as zero. 

Intra-Node PME Interconnections: 

;s PMEs are configured within the node as a 2 by n array. Each PME is interconnected with te three 
neighbors (edges wrap only in the eacond dimension) using a eel of input/ouiput ports, thus, providing full 
duplex communication capably between PMEs- Each PMEs external input and output port ts connected to 
node hO pins. Input end output porks may be connected to share pins for haJf-dupiex communication or to 
separate pins for full-duple* capability. The interconnections for a 4d modified hypercube node are shown 

20 in figures 9 and 10. (Note that where n is even the node can be considered to be a 2 by 2 by n/2 array.) 
FIGURE 9 iilustraros the the eight processing etemems 500, 310. 520, 630, 540, 560, 560, 570 within 
the node. The PMEs are connected in a binary hypercube communication network. This binary hypercube 
displays three intra connections between PMEs (501 , 51 1 521, 531, S4i, 551, 561, 571, 590, 591, 592, $93}, 
Communication between the PME is controlled by in and out registers under control of the processing 

2$ element. This diagram shows the various paths that can be taken to escape I/O out any of the eight 
directions, +-w 525, 505. 515. 555. +-y 505. 545. +. 2 535, 575. The communication can be 
accomplished without storing the data inio memory if desired. 

It may be noted that whBe a network switch chip could be employed to connect various cards each 
having our chip with other chips of me system, it can and should desirably be eliminated. Our inter PME 

30 network thai we describe as (he "*d torus" is the mechanism used for Inter PME-communlcation. A PME 
can reach any other PME in tfie array on this interface. (PMEs in between may be StorerForward or Circuit 
Switched) 

Chip Relationships for Interconnections 

We have discussed Use chip, and FIGURE 11 shows a block diagram of the pme ProtessorVMemory 
chip. The chip consisis oi me following etomeitis each of which will be described in later paragraphs: 
1. 8 PMEs each consisting of a 16 bit programmable processor and 32K words of memory (6*K bytes), 

2. Broadcast interface (BCl) to permit a controller to operate all or subsets of the PMEs and to 
40 accumulate PME requests, 

3. Interconnection Levels 

a. Each PME supports four 8 bit wide trrter-PME communication paths. Tnese connect to 3 
neighboring PMEs on the ch)p and 1 off chip PME. 

b, BroadcasMo-PME busing . which makes data or instructions available. 

4$ c. Service Request lines that permit any pme to send a code to the controller. The bci combines the 
requests and forwards a summery, 

d. Serial 6ervloe loops permit the controller to read aO detail about the functional blocks. This level of 
inteiccmnection extends from the BCI to ail PMEs {FIGURE 1 1 for ease of presentation omits this 
detail-) 

so 

mterconnecnon and Routing 

Tne MPP wai be Implemented by repftcelcn of the PME. The degree of replication does not affect the 
interconnection and routing schemes used, FIGURE 6 provides an overview of the network interconnection 
55 scheme. The chip contains Q PME3 with interconnections to their immediate neighbors. This interconnection 
pattern results in tte three dimensional cube structure obown in FIGURE 10. Each of the processors within 
the cube has a dedicated bidirectional byte per; to the chip's pins; we refer to the set of 0 PMEs as a node. 
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An n by n array of nodes is o cluster. Simpte bridging between the * and - x ports end the + and • y 
porte provide the cluster node Interconnections, Here the our preferred chip or node has 8 PMEo. each of 
which manages a single external port. This distributes the network control function and eliminates a possible 
bottleneck for ports, Bridging the outer edges makes th© cluster Into a logical torus. We haw considered 
s clusters with n = 4 and n~8 and believe that n =8 Is the better solution for commercial applications while 
n = 4 fs better for military conduction cooled applications. Our concept does not impose an unchangeable 
duster size. On the cornrary, we anticipate some applications using variations- 

An errey of dusters results In the 4 dimensional torus or hypercube structure Illustrated In FIGURE 10. 
Bridging between the * and - w ports and + and - z ports provides the 4d torus interconnections. This 
30 results in each node within a cluster connected to an equivalent node in all adjacent clusters, (This provides 
$4 ports between two adjacent clusters rather than the 8 ports that would resuH from larger clusters.) As 
wnh the duster size, the scheme does not imply a particular size array. We have considered 2x1 arrays 
desirable for workstations and MIL applications and 4x4« 4x8 and 0x8 arrays lor mainframe applications. 

Developing an er?ay of 4d toruses Is beyond the gate* pin, and connector limitations of our current 
>5 preferred chip. However, ihat limitation disappears with our alternative on-chip optical driver/receiver is 
employed. In ihls embodiment our network could use an optical path per PMC; wllh 12 rather than 8 PMEs 
per chip the array of 4d toruses with mufH-Tflop (Teraflop) performance and it seems to be economically 
feasible to make such machines available for the workstation environment. Remember that such alternative 
machines will use the application programs developed for our current preferred embodiment 

4d duster Organkation 

For constructing a 4d modified hypercubs 360, as illustrated in FIGURES 5 and 10, nodes supporting 3 
external ports 31 5 are required. Consider Che external ports to be labeled as +X, +Y. +2, + W, -X, -Y, -2, 

29 -W. Then using m 1 nodes, a ring can be constructed by connecting the +X to -X ports. Again m 2 such 
rings can be interconnected into a ring of rings by irrterconnecting me matching *Yto *Y porta. This level 
of structure will be called a cluster 320. With mi =m2"& it provides tor 51$ PMGb and such a cluster will 
be a building block for several size systems {330, 340. 350). as Illustrated with m=8 in FIGURE $. 

30 4d Array Organisation 

For building large finegrained systems, sets of m3 clusters are interconnected in rows using the + 2 to 
■Z ports. The rm rows are then interconnected using the + W to -W perls. For mi «...rm *8 this results In 
system with 32768 or PMEe, The organization does not require that every dimension be equal y 

as populated as shown in FIGURE 6 (large fine-grained parallel processor 370). In the case of the fine-grained 
small processor, only e cluster might be used with the unused Z end W ports being Interconnected on the 
card. This technique saves card connector pins and makes possible the application cri this scalable 
processor to workstations 340, 350 and avionics applications 330, both of which are connocior pin limited. 
Connecting +f- ports together In the Z and W pairs leads to a workstation organization that permits debug. 

40 test and large machine software development. 

Again, much smaller scale versions of (he structure can be developed by generating the structure with a 
value smaller Shan m = 8. This will permit construction of single card processors compatible with the 
requirements for accetsrators in the PS/2 or RtSC System 6000 workstation 350, 

170 Performance 

I/O performance includes overhead to setup transfers and actaa! burst rate data movement Setup 
overhead depends upon application function ¥0 complexity and network contention. For example, an 
application can program circuit switched traffic vrith buffering to resolve conflicts or it can have all PMEs 
so transmit left and synchronize. In die first case. I/O is a major task and detailed analysis would be used to 
size it We estimate that simple case setup overhead Is 20 to 30 clock cycles or & to 1.2 u-sec. 

Burse rate I'O Is the maximum rat© a PMG can transfer dara (with either an on or off chip neighbor.) 
Memory access limits set the data rate at 140 nsec per byte, corresponding to 7.14 Mbyte/s. This 
performance includes buffer address and count processing pfas data reaoVwrite. it uses seven 40ns cycles 
ss per 18 bit word transferred. 

This burst rate performance corresponds to e cluster having a maximum potential transfer rate of 3.05 
Gbytes/s. li also means that a set ol eight nodes along a row or column of the cluster will achieve 57 
Mbyte/9 burst data rate using one set of their 8 available ports. This number is significant because I/O with 
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ths external world witi be done by logically 'umrtpping' an edge of the wrapped cluster and attaching it to 
the externa! system bus. 

Inter-PME Routing Protocol 

The SIMD/MIMD PME comprises inter processor oommunk:Anon to the external I/O facilrHes. broadest 
corrsrol interfaces, and switching features which aitow both SIMD/MIMD operation whhin the same PME. 
Embedded in the PME Is fte fuBy distributed progrommeble I/O router for processor communication end 
data transfers between PMEs, 

■to The PMEs have fully distributed interprocesser communication hardware to on-chip PMEs as well as to 
the external I/O facilities which connect to the interconnected PMEs in the modified hypercube conffcura- 
tlon- This hardware is complemented with the flexible ryogrammability of the PME to control the 1*0 activity 
via software, The programmable I/O router functions provide for generating data packets and packet 
addresses. With this Information the PME can send the information thru the network of PMEs in a directed 

ss method or out multiple paths determined by any fault tolerance requirements. 

Distributed fault tolerance algorithms or program algorithms can Lake advantage of ihe programmablllty 
along wnh the supported circuit arched modes of the PME. This performance combinational mode 
enables everything from off-line PMEs or optimal path data structures to be accomplished via the 
programmable I/O router, 

no Our study of applications reveals that it is sometimes most efficient to send bare data between PMEs. 
At other times applications require dais and routing information. Further, it is sometimes possible to pian 
communications so (hat netwodc conflicts cannot occur: other applications ofler the potential tor deadlock! 
unless mechanisms for buffering messages at Intermediate nodes are provided. Two examples illustrate me 
extremes. In the relaxation phase of a PDE solution, each grid point can be allocated to a node. The inner 

£S loop process of acquiring data from a neighbor can easily be synchronized over all nodes. Alternatively, 
image transformations use local date parameters to determine communication target or source identifiers. 
Thie resutte in data moves through multiple PMEs, and each PME becomes involved in the routing task for 
each packet. Pi epiennlng such traffic is generally not possible- 

To enable the network to be efficient tor all types of transfer requirements, we partition, between the 

30 H/W and 8/W, the responsibility tor data routing between PMEs. S/W does most of the task sequencing 
function. We added special features to the hardware (H/W) to do the inner loop transfers and minimize 
software <8/W) overhead on the outer loops, 

I/O programs at dedicated interrupt levels manage the network. For most applications, a PME dedicates 
tour interrupt levels to receiving data from the four neighbors. We open a buffer at each level by loading 

3s registers at the level, and executing the IN (rx uses buffer address and transfer count but does not await 
data receipt) and return Instruction pair Hardware then accepts words from (he particular Input bus and 
stores them to the buffer. The buffer full condition will torn generate the interrupt and restore tne program 
counter to ihe instruction after the RETURN, This approach to interrupt levels permits I/O programs to be 
written that do not need to test whet caused *e Interrupt Programs reed data, return, end then continue 

40 directly into processing me data they read. Transfer overhead is minimized as most situations require Iritis 
or no register saving. Where an application uses synchronization on I/O, as in (he PDE example, then 
programs can be used to provide that capability. 

Write operations can be started m several ways. For the PDE example, at the point where a result Is to 
be sent to a neighbor, the application level program executes a write call. The call provides butter location^ 

*b word count and target address. The write subroutine includes the register loads and OUT instructions 
needed to initiate the HW and return to the application. H W does the actual byte by byte data transfer. 
More complicated output requirements will use an output service function at the highest mlsrrupL level. Both 
application and interrupt level tasks access that service via a soft interrupt. 

8etting up circuit switched paths builds on these simple read and write operatione. We start wfth all 

so PMEs having open buffers sl2ed to accept packet headers but not the data. A PME needing to send dale 
initiates me transfer by sending an address/data block to a neighboring PME whose address baiter marches 
the target In ihe neighboring PME address information will be stored; due to buffer full an interrupt will 
occur. The interrupt S/W tests the target address and will either extend (he buffer to accept the data or write 
the target address to an output port and set the CTL register for transparent data movement. (This allows 

56 the PME to overlap its application executions with the circuit switched bridging operation.) The CTL register 
goes to busy state and remains transparent until reset by the presence of a signal at end of data stream or 
abnormally by PME programming. Any number variations on these themes can be implemented. 
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System \iO and Array Oirsctor 

FIGURE 12 shows an Array Director In the preferred embodiment, which may perform the functions of 
the controller of FIGURE 13 which describes the system bus to array connections, FIGURE 13 is composed 

s of two parts, (a) (he bus to/from a cluster, and part (b) the communication of information on the bus to/from 
a PME, Loading or unloading the array is done by connecting the edges of dusters to the system bus. 
Multiple sysrem buses can be supported with mufspte clusters. Each duster supports 50 to 57 Mbyte's 
bandwidth. Loading or unloading the parallel array requires moving data between all or a subset of the 
PMEs and standard buses <ie MicroChannel, VM&bue, FufureBue, etc). Those buses, part of the host 

70 processor or array conlroflsr, are assumed to be rigidly specified. The PME Array therefore must be 
adapted to the buses- The pme Array can be matched to the bandwidth of any bus by interleaving quo data 
onio n PMEs, with n picked to permit PMEs both I/O and processing time, FIGURE 13 shows how we might 
connect the system buses to the PMEs ai two edges of a cluster. Such an approach would permit 114 
Mbyte/s to be supported, it also permte data to be loaded at half the peak rate to two edges slmulta* 

>s nsousry. Although this reduces ths bandwidth to 57 Mbyte/Baluster, it has the advantage of providing 
orthogonal data movement with In the array and ability to pass dala between two buses. (We use those 
advantages to provide fast transpose and matrix multiply operation.) 

As shown in part (a) of FIGURE 1$, the bus "dots to all paths on the edges of the cluster; and. the 
controller generates a gate signal to each path In the required Interleave timing. If required to connect to a 

20 system bus wiih greater than 57 Mbyte's, the data will be interleaved over multiple clusters. For example, in 
a system requiring 200 Mbyte/s system buses, groups of 2 or 4 dusters will be used. A large MPP has the 
capacity to attach 16 or 94 such buses to its xy network paths. By using the w end z paths In addition to the 
x and y paths, mat number could be doubled, 

FIGURE 13 pert (b) shows how the data routes to individual PMEs. The FIGURE shows one particular 

29 w>,y or z path that can be operated at 3M3 Mbyte/s in burst mode- If the data on the system bus occurred 
in bursts, and if the PME memory could contain the complete burst, then only one PME would be required. 
We designed the PME t/O structure to require neither of these conditions. Data can be gated into the 
PME*0 at the full rate umfl buffer full occurs. At that instant PMExO will change to transparent and PMBci 
will benjn accepting ft* data. Within PMExO processing of the input data buffer can begin. PMEs that have 

a> taken data and processed It are limited because they cannot transmit ths results while In die transparent 
mode. The design resolves this by switching the data stream to the oppose end of the path at intervals. 
FIGURE 13(b) shows that under SW control one might dedicate PMExO through PMEx3 to accepting data 
while PMEx12 Uvough PMEx15 unload results and visa-versa. The controller counts words and adds end of 
block signals to tl>e data stream, causing the switch in direction. One count applies to ail paths supported 

as by the controller so controller workload is reasonable, 

SYSTEMS FOR ALTERNATIVE COMPUTERS 

F1-3URE 18 illustrates a system block diagram for a host attached large system with a single application 

<so processor interface (API). This iflustratton may also be viewed with me understanding that our invention may 
be employed in stand alone system which use multiple application processor interfaces (not shown) This 
configuration will support DASD/G rchptea on all or many clusters. Workstation accelerators can eliminate the 
host, application processor interface (API) and cluster synchronizer (OS) illustrated by emulation. The CS 
not always required. It will depend on (he type of processing that is being performed, as well as the 

4s physical drive or power provided for a particular application which usee our invention. An application this is 
doing mostly MIMD processing will not place as high a workload demand on the controller, so here the 
control bus can see very alow pulse rise Umea. Conversely, system doing mostly asynchronous A-SIMD 
operations with many independent groupings rney require faster control busing. In this case, a cluster 
synchronizer will be desirabte. 

so The system btocfc diagram of FIGURE 18 Illustrates that a system might consist of host, array controller 
and PME sway. The PME array ia a eel of dusters supported by a set of cluster controllers (CC). Although 
a CC Is shown lor each duster that relationship is not saictfy required. The actual ratio of clusters to CCs Is 
flexible. The CC provides redrlve to» and accumulation from the 84 BCIs/cluslers. Therefore, physical 
parameters can oe used establish the maximum ratio. Additionally, the CC wil) provide for controlling 

ss multiple independent subsets of the PME array; that service might also become a gating requirement A 
study can be mode to detormine these requirements foi any particular application of our invention. Two 
versions of the CC will be used. A cluster that is to be connected to a system bus requires the CC providing 
interleave controls (see System I/O and FIGURE 18) and bi-atate drivers. A more simple version lhai omits 
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the tri-EtEte busing features can also bo employed. In the cas* of large systems, a second stage of redr'rvs 
and accumulation is added. This level is the cluster synchronizer <C&). Hie set of CCs plus CS and the 
Application Processor Interface (API) make up the Ai ray Controller. Only the API is a programmable unit 

8everat variations of this system synthesis scheme will be used. TheB© result in different hardware 
3 configurations Tor various applications, but they do not have a major impact on the supporting software. 

For a workstation accelerator, the duster coolers will be attached directly to the workstation system 
bus; the function of me API will be performed by the workstation. In (he cose err a RtSOBOCO, The system 
bus Is a Micro Channel and the CO units can plug directly into the slots within {he workstation. This 
configuration places the VO devices (OASO, SCSI and display interfaces} on the same bus that 
w loads/unloads the array. As such Ihe parallel array can be used Cor I/O intensive tasks like real time image 
generation or processing. For workstations using other bus systems (VME-bus« FutureBus, e»<)« a gateway 
Interface will be used. Such modules are readily available in the commercial marketplace, Nets that in mess 
minimal scale systems a single CC can be shared between a determined number of clusters, and nefther a 
CS nor an API Is needed 

;s A MIL avionics application might be simitar in size to a workstation, but it needs different interfacing. 
Consider what may become the normal military situation. An existing platform must be enhanced with 
edoitione) processing capability, but funding prohibits a complete processing system redesign. For this we 
would attach to the APAP array a smart memory coprocessor, to this case, a special application program 
interface API that appears to the host as memory will be provided. Data addressed to the host's memory 

so wii then be moved to the array vie CC(s). Subsequent writes to memory can be detected end interpreted as 
commands by ihe API so that the accelerator appears to be a memory mapped coprocessor. 

Large systems can be developed as either host attached or as stand alone configurations. For a host 
attached system, ihe configuraiion shown in FIGURE 18 is useful. The host will be responsible for I/O, and 
Ihs API would serve as a dispatched task manager. However, a large stand alone system is also possible in 

£0 special situations. For example, a database search system might eliminate the host, atiach DA$D to the 
MicroChannel of every cluster and use multiple APIs as bus masters slaved to me PMEs. 

Zipper Array Interface wtth Sternal I/O 

as Our zipper provides a fast VO connection scheme and is accompDshed by placing a switch between two 
nodes of the array. This switch will allow for the parallel communication into and out of the array. The fast 
1-0 will be implemented along one edge or the array rings and acts like a targe zipper into the K Y, W, Z 
rings. The name 1 'zipper connection* is given to the fast I/O. Allowing data to be transferred Into and out of 
ths network while only adding switch delays to transfer the data between processors is a unique loading 

3S technique. The switching scheme does not disrupt ths ring topology created by the X, Y, W, Z buses and 
special support hardware allows the ripper operation to occur while the PE is processing or routing data 

Ths ability to bring data into and out of a massively parallel system rapidly is en important 
enhancement to the performance of Iho overalt syslero.We believe that the way we implement our fast I/O 
without reducing the number of processors or dimension of the array network is unique in the field of 

40 massively parallel environments. 

The modified hyparcube arrangement can be extended to permii a topology which comprises rings 
within rings. To support me interface to the external I/O any or all of the rings can be logically broken. The 
two ends of the broken ring can then be connected to external 1*0 buses. Breaking the rings is s logical 
operation so as to permii regular inter-PME communication at certain time intervals while permitting I/O at 

<s other time intervals* This process of breaking a level of rings within the modified hypercube effectively 
'unzips 1 rings for I/O purposes. The fast I/O "zipper" provides a separate interface into the array. This zipper 
may exist on 1 to n edges of the modified hypercube and could support either parallel Input into multiple 
dimensions of the array or bioadcast to multiple dimensions of the anay. Further data transfers into or out 
of the array could alternate between the two nodes direcity attached to the zipper. This I/O approach is 

so unique and it permits developing different zipper sizes to satisfy particular application requirements. For 
example, in the particular configuration shown m FIGURE 6, called me large fine-grained processor 350, the 
zipper for the Z and W buses will be dotted onto ihe MCA bus. This approach optimizes the matrix 
transposition time, satisfying a particular application requirement for the processor. For e more detailed 
explanation of the "zipper 0 structure, reference may be had to the APAP I/O ZIPPER application tiled filed 

58 concurrently herewith. The aripper b here illustrated by Figure 14 

Depending on the conjuration and me need of the program to roll data and program Into and out of 
the individual processing elements, me size oS the zipper can be varied. The actual speed of me VO zipper 
is approximately the number of rings attached times me PfclE bus width, times the PME clock rate all 
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divided by 2. (The division permits the receiving PME time to move data onward. Since il can send it to any 
of n places I/O oontentton to completely absorbed over the Array.) With existing teohnoiogy* ie_, 5 MB/sec 
PME transfer rato, 64 rings on the zipper, and Interleaved to two nodes transfers, 32Q.MB/sec Array transfer 
rates are possible. (See tho typical ripper configuration in FIGURE 14). FIGURE 1* illustrates the fast VO or 
5 the so-called 'zipper connection* 700, 710 which exist 9 as a separata Interface Into the array. This zipper 
may exist on one 700 or two edges 700, 710 of the hypercube network by dotting onto the broadcast bus 
720, 730, 740, 750, at multiple nodes in the array 751, 752, 753, 754 and In multiple directions 770, 760. 
790,751.755, 757. 

Today's MCA bus supports 60 to 1 60 MB per second buret transfer rate and therefore is a good match 
?o tor a single zipper in simple or non-interleaved mode. The actual transfer rate given channel overhead and 
efficiency is something Jess man that For systems that have even more demanding I/O recrements, 
roulh'pte zippers and MCA buses can be utilized These techniques are seen to be important to processors 
that would support a large external storage associated wilh nodes or clusters, as might be characteristic of 
database machines. Such I/O p/owtf? capability Is completely unique to this machine and has not previously 
'5 been mcorporated in either massively parallel, conventional single processor, or coarse-grained parallel 
machines. 

Array Director Architecture 

ao Our massively parallel system is made up of nodal building blocks of multiprocessor nodes, dusters of 
nodes, and arrays of PME 3 already packaged in clusters. For control of these packaged systems wa 
provide a system array director which with the hardware controllers performs the overall Processing 
Memory Element (PME) Array Controller Junctions in me massively parallel processing environment. The 
Director comprises of three functions! areas, the Application Interface. Che Cluster Synchroniser, and 

29 normally a Cluster Controller. The Array Director wfll have the overall control ol the PME array, using the 
broadcast bus and our zrpper connection to steer data and commands to all of the PMEs. The Array 
Director functions as a software system interacting whft the hardware to perform the role as the shell of the 
operating system. Tne Array Director In performing this role receives commands from the application 
interface and issuing the appropriate array instructions ana hardware sequences to accomplish the 

ao designated task. The Array Director's main function Is to continuously feed the instructions to tlte PMEs and 
route data in optimal sequences to keep the traffic at a maximum and collisions to a minimum. 

The APAP computer system shown in FIGURE O is illustrated in more detail in the diagram of FIGURE 
12 which Illustrates the Array Director which can function as a controller, or array controller, as illustrated In 
FIGURE 13 and FIGURES 18 and 19. This Array Director 610 frustrated in FIGURE 12 is shown in the 

35 preferred embodiment of an APAP in a typical conffgu ration of n identical array clusters 665, 670, 680, 690, 
wUh an array director 6i0 for the dusters of 512 PMEs, and an application processor interface 030 for the 
application processor or processors 600. The synchronizer 650 provides the needed sequences to the array 
or cluster controller 640 and together thoy make up the "Array Director" 6I0, Tfie application processor 
Interface 630 wfll provide the support for the host processor 300 or processors and test/debug workstations. 

<*> For APAP unite attached to one or more hosts, the Array Director serves as the interface between the user 
and the array of PMEs- For APAPs functioning as stand alone parallel processing machines, the Array 
Director becomes the host unit and accordingly becomes involved in unit IX) activities, 

The Array Director will consist of the following four functional areas: (see the functional block diagram in 

FIGURE 12) 

45 1 . Application Processor interface (API) 600, 

2, Ouster Synchronizer (CS) 650 (0 x 8 array of clusters), 

3, Ouster Controller (0C> 640 {B x 1 array or nodes). 

4, Fast t'O (zipper Connection) 620. 

so The Application Processor Interface (API) 630: 

When operating in attached modes, one API will be used for each host That API will monitor the 
incoming data stream to determine what are instructions to the Array dusters 835. 670, 600, 680 and what 
are data for the Fast vO (zipper) 620. When in standalone mode, me API serves as the primary user 
59 program host. 

To support these various requirements, the APIs contain the only processors within the Array Director, 
plus the dedicated storage for the API program and commands. Instructions received from the host can call 
for execution of API subroutines, loading of API memory with additional functions, or for loading of CC and 
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PME memory with naw &V#. As described in ihe S"*V overview section, these various type requests can be 
restricted to subset oF users via (he Initial programs loaded Into (he API. Thu8> Ute operating program 
loaded will determine the type of support provided which can bs tailored to match the performance 
capability of th© API, This writer permits the APAP to be adjusted to the needs of multiple users requiring 
3 managed and well tested services, or to the Individual user wishing to obtain peak performance on a 
particular application. 

The API ateo provides for managing the path to and from the vQ zipper. Data received from the host 
system in attached modes, or from devices In standalone modes Is forwarded to the Array. Prior to initiating 
this type of operation the PMEe within the Array which will be managing the K> are Initiated. PMEs 

70 operating in MIMD mode can utilize the feel interrupt capability znd either standard SAV or special functions 
for this transfer while those operating in SlMD modes would hove to be provided detailed control 
instructions. Data being sent from the 1*0 zipper requires somewhat the opposite conditioning. PMEe 
operating in MIMD modes must signal Ihe API via the high speed serial interface and await a response from 
the API, while PMEs In SlMD modes are already In synchronization with the API and can therefore 

;s immediately output data. The ability to system switch between modes provides a unique ability to adjust the 
program to the application. 

Cluster Synchronizer (C3) 650 

20 Tbe CS 650 provides the bridge between the API 630 and CC 640. it stores API 630 output in FIFO 
stacks and monitors the status being returned from ihe CO 850 (both parallel input acknowledges and high 
speed serial bus data) to provide trie CC, to timely fashion, with the desired routines or operations that need 
to be steried. The C$ provides the capability to support different CCs and different PMEs within clusters so 
as lo permit dividing the array into subsets. This is done by partitioning the array and then commanding the 
involved duster controllers to selectively forward the desired operation. The primary functon of the 
synchronizer is to keep all clusters operating and organized such that overhead time is minimized or buried 
under the covers of PME execution time We have described how iha use of ihe duster synchronizer in A- 
SIMD configurations is especially desirable. * 

30 Cluster Corrtroaer (CC) 640 

Ths CC 640 interfaces to the node Broadcast and Control Interface (BCI) 606 for the set of nodes in en 
array cluster B35. (For a 4d modified hypercube with 8 nodes per ring trial means the CC 340 la attached to 
64 BCls 605 in an 8 by 8 array of nodes and Is controlling 5i2 PMEs. Sixty-four such clusters, also in a 8 

3$ by 8 array, lead to the full up system with 32768 PMEs.) The CC 640 will send commands and data 
supplied by fre C$ 660 to the BCl parallel port and return the ackrKwdedgement data to ihe C$ 650 when 
operating in MIMD modes, in siMO mode the interface operates synchronously, and step by step 
acknowledgments are not required. The CC 640 also manages and monitors the high speed seri2l port to 
determine when PMEs within the nodes are requesting services. $uch requests are passed upward to the 

40 CS 650 while the raw data from the high speed serial interiaos is made available to the etetua oisptey 
interface. The CC 640 provides the CS 660 w'rih an interface to specific nodes within the cluster via the 
standard speed serial Interface. 

In SlMD mode the CC win be directed to send Instructions or addresses to all the PMEs over ma 
broadcast bus. The CC can dispatch 10 bit instruction to all PMEs every 40 nanoseconds when in SlMD 

4$ mode. By broadcasting groups of native instructions to the PME, the emulated instruction set is formed. 

When in MiMD mode the CC will wait for the endop signal before issuing new instructions to the PMEs. 
The concept ol the MIMD mode is to build airings of micro-routines with native Instructions resident in the 
PME. These strings can be grouped together to form the emulated instructions, and ttese emulated 
instruction can be combined to produce service-canned routines or library functions. 

so When In SIMDVMIMD (SltVUMD) mode, the CC will Issue instruction as If In SlMD mode and check tor 
endop signals horn certain PMEs. The PMEs mat are in MIMD wdi not respond to the broadcast instructions 
and will continue with there designated operation. The unique status Indicators will help ihe CC to manage 
this operation and determine when and to whom lo present the sequent^ instructions. 

55 Operational Software Levels 

This application overviews the operational software $/W levels to provide further explanation of the 
services performed by various hardware H/W components. 
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Computer systems generally used have an operating system. Operating system kernels which are 
relatively complete must be provided in most massive MIMD machines, where workstation class CPU chips 
iun kernels such as Mech. The operating system kernal eupports message passing or memory coherency. 
Otter massively parallel systems based upon SIMD models have almost no intelligence in the array. There 
s are no "program counters" out In the array, and thus no programs to execute locally. All Instructions are 
broadcast. 

In the systems we have provided with our PME as the basis tor cluster arrays, there Is not need tor an 
operating system at each chip, a node. We provide a library of key functions for computation and/or 
communication within each PE <PME) that can be invoked at a high level. SiMCHlto InstnjcSons are 

jo broadcast to the array to set each of a selected set of PMEs- These PMEs can then perform in full MIMD 
mode one or more of these library routines. In addaon, basic imorupt handler and communications routines 
are resident In each PME allowing the PME to handle commutation on a dynamic basis. Unlike existing 
MIMD machines, (he APAP structure need nor include an entire program in PME memory. Instead all of lhal 
code, which is essentially serial, is the cluster controller. Thus such code, 90% by speoe and 10% by time 

>c (typically) cen be broadcast in a SIMD fashto to an array of PMEs. Only the bury parallel inner loops are 
distributed to Ihe PMEs In a dynamic fashion. These are then inflated Into MIMD mode just as other 
"library" routines are. This enables use of program models which are Single Program Multiple data to be 
used where the same program is loaded in each PME node, with embedded synchronization code, and 
executed at the local PME. Design parameters affect bandwidth available on different links, and the system 

20 paths are progiainmeticairy configurable, allowing high bandv/Hh links on a target network, and allowing 
dynamic partition of off chip like PME-to-PME links to provide more bandwidth on specific parhs as meets 
the needs of a particular application. The links leaving a chip mate directly with each other, without the need 
for external logic. There are sufficient links and there is no predesigned constraint as to which other tints 
they can attach to, so lhal the system can have a diversity of interconnection topologies, with routing 

29 performed dynamically and programmaiicaJly. 

The system allows usage of existing compilers and parser* to create an executable paraiel program 
which could run on a host or --workstation based configuration Sequential source code tor a Single Program 
Multiple Data system would pass through program analysis* for examination of dependency, data and 
controls, enabling extension ot program source to include can graphs, dependency tables, aliases, usage 

3D tables and the like. 

Therafter, program traitsformation would occur whereby a modified version of the program would be 
created that extends the degrees of parallelism by combining ssquences or recognizing patterns to 
generate explicit compiler directives. A next step would be a data allocation and partitioning stop, with 
message generation » which would analyze data usage patiemand allocate so that elements to be combined 

as would share common indexing, addressing pattern, and these would provide embedded program compiler 
dlrectfves and calls to ccmmunlcatlon services. At this point the program would pass to a level partitioning 
step, A level partitioning step would separate the piogram into portions for execution in ARRAY, n> ARRAY 
CONTROLLER (array director or cluster controller), and HOST, Array portions would be interleaved in 
sections wKh any required message passing synchronization functions. At this point level processing could 
proceed. Host sources would pass to a level compiler (FORTRAN) tor assembly compilation. Controller 
sources would pass to a microprocessor controller compiler, and items thai would be needed by a single 
PME and not available in a library cafi would pass to a parser (FORTRAN OR C) to an intermediate level 
language representation wnich would generate optimized PME code and Array Controller code. PME code 
would be created at PME machine level, and would include library extensions, which would pass on load 

45 trno a PME memory. During execution a PME parallel program in the §pmd process of execution could call 
upon already coded assembly service functions from a runtime iibrcry kernel. 

Since the APAP can function as either an attached unit that ie closely or loosely coupled with Its host or 
as a stand alone processor, some variation in the upper level S/W models exists. However, these variations 
serve to integrate the various type applications so as to permit a single set of lower level functions to satisfy 

so ail three applications. The explanation win address (he attached version &W first and then the modifications 
required tor standalone modes. 

si any system, as Illustrated by FIGURE f 0, where the APAP Is Intended to attach to a host processor 
the user's primary program would exist within ihe host and would delegate to the APAP unit tasks and 
associated data aa needed to provide desired load balancing, The choice of interpreting ttrc dispatched 

so task's program within the host or the Array Director is a user option. Host level interpretation permits the 
Array Director to work at interleaving users which do not exploit close control of the Array* while APAP 
Interpretation leads to minimal latency In control branching but tends to limit the APAP time to perform 
multi-user management lasks. This leads to the concept lhal the APAP and host can be tightly or loosely 
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coupled. 

Two examples illustrate the extremes: 

i. When APAP is atiached to 3090 class machines with Flowing Point Vector Facilities, user data in 
compressed form could be stored within the APAP, A host program the* called for a vector operation 

a upon two vectors with differing sparseness characteristics would then send instructions to Ihe APAP to 
realign the date into element by element matching pairs, output the result to the Vector Facility, read 
answer from the Vector Facility end finally reconfigure data into final sparse data tonm, Segments of the 
APAP would be interpreting and building sparse matrix bit maps, while other sections would be 
calculating how to move data between PMEs such that It would be property alined for the zipper 

10 Z. With APAP attached to a small inflight military computer, the APAP could be performing trie entire 
workload associated with Sensor Fusion Processing, The host might Initiate the process once, send 
sensor data as it was received to the APAP and then wart for results. The Array Director would then have 
to schedule and sequence the PME array through perhaps dozens of processing 6teps required to 
perform the process. 

/s The APAP will support three levels of user control: 

1. Casual User. Sme works with supplied routines and library Function. These routines are maintained at 
fte host or API level and can be evoked by the user via subroutine calls within his program. 

2. Customizer User. S/he can write special functions which operate wKhin tha API and which directly 
evoke routines supplied with the API or services supplied with the CC or PME. 

so 3. Development User. $rt>e generates programs for execution in the CC or PME. depending upon API 
services for program load and status feedback. 

Satisfying these tfiree user levels in either closely of loosely coupled systems leads to Ihe partitioning 
of HW control tasks, 

so API Software Tasks 

The application program interlace API contains 3AV services that can test Ihe leading words of data 
received and can determine whether that data should be Interpreted by the API, loaded to some storage 
within ihe Array Director or PME, or passed to the L-0 zipper. 

30 For data that is id be Interpreted, (he API determines the required operation and invokes the function. 
The most common type operation would call for the Array to perform some function which would be 
executed as a result of API writes to the OS (and indirectly to the CC). The actual data written to the CS CC 
would In general be constructed by the API operational routine based upon the parameters passed to the 
API from the host. Data sent to the CS/CC would in turn be forwarded to the PMEs via the node BCl. 

38 Data could be loaded to ©Ether API storage, CC storage, or PME memory. Further, data to be loaded to 
PME memory could be loaded via either the I/O ^pper or via the node BCL For data to be put into the API 
memory, the incoming bus would be read then written to storage. Data targeted to the CC memory would 
be similarly read and then be written to the CC memory. Finely, data for the PME memory (in this case 
normally new or additional MIMD programs* could be sent to all or selected PMEs via the C$/CC/Node 3Cl 

40 or to e subset of PMEs tor selective redistribution via the I/O zipper. 

When data is to be sent to ihe I/O zipper, it could be preceded by inline commands that permit the 
PME MIMD programs to determine Its ultimate target or, rt could be preceded by calls to the API service 
functions to perform either MIMD initiation or SIMD transmission. 

In addition to responding to requests tor service received via the host interface, the API program will 

45 respond to reouest from the PMEs* Such requests wfl) be generated on the high speed serial port and will 
be routed through the CC CS combination. Requests of this sort can result in the API program's directly 
servicing the PMEa or accessing the PMEs via the standard speed serial port to determine further qualifying 
data relative to the service request. 

so PME Software 

The software plan includes: 

o Generation of PME resident service routines (that is> 'an extended l$A f ) tor complex operations end 
i/O management 

as o Definition and development of controller executed subroutines that produce and peso control and 
parameter data to &e PMEs via the BCl bus. These subroutines: 
K cause a set of PMEs to do mathematical operations on distributed objects, 
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2. provide i/O data management and synchronization services for PME Array and System Bus 
Interactions. 

3. provide startup program load, program overlay and program task management far PMEs, 
o Development of data allocation support services lor host level programs, and 

s o Development ol a programming support system including assembler, simulator, and H/W monitor and 
debug workstation. 

Based upon studies of military sensor fusion, opHmkzatsorv image transformation, US Post Office optical 
character recognition and FBI fingerprint matching applications, we have concluded thai a parallel processor 
programmed with vector and array commands (like BLAS calls) would be effective. Hie underlying 
to programming model must match the PME array characteristics feasible with today's technology- Specifi- 
cally: 

o PMEs can be independent stored program processors, 
o The array can nave thousands of PMEs. and be suitable tor fine grained paralielism. 
o Inter-PME networks will have very high aggregate bandwidth and a small "logical diameter*, 
is o Bul« by network connected microprocessor MIMD standards, each PME Is memory Kmlled. 

Prior programming on MIMD parallel processors has used task dispatching methodology. Such 
approaches lead to each PME needing access to an portion of a large program. This characteristic, in 
combination wish the non-shared memory characteristic ol the HAW, would exhaust PME memory on any 
significant problem. We therefore target what we believe is a new programming model, called 
20 'asynchronous SIMD* (A-SIMD) type processing. In ihis connection see USSN 7S8.7Q8, filed November 27. 
1991 of P. Kogge, which is Incorporated herein. 

A-SIMD programming in our APAP design means thai a group of PMEs will be cfirected by commands 
broadcast to them as In 8IMD models. Tlte broadcast command wiir initiate execution of a MIMD function 
wrihin each PME, That execution can involve data dependent branching and addressing within PMEs, and 
re I/O based synchronization with either other PMEs or the BCI. 

Normally, PMEs will complete the processing and synchronize by reading the next command rrom the 

BCI. 

The A-SIMD approach includes both MIMD and 8IMD operating modes. Since the approach imposes no 
actual time limits on me command execution period, a PME operation that synchronizes on data transfers 

oo and executes indefinitely can be initiated. Such functions are very effective in data filtering, DSP, and 
systolic operations. (They can be ended by either BCI interrupts or by commands over the serial control 
buses.) $IMD operation results from any A-SIMD control stream that does not Include MIMD Mode 
Commands. Such a control stream can include any of the PMEs native instructions. These "instructions are 
routed directly lo the instruction decode logic of ihe PME, Eliminating to PME instruction fetch provides a 

33 higher performance mode for tasks that do not involve data dependent branching. 

This programming model (supported by H/W features) extends to permitting to array of PMEs to be 
divided Into Independent sections. A separate A-SIMD command stream controls each section. Our 
application studies show that programs of interest divide into separate phases (ie. input input buffering, 
several processing steps, and output formatting, etc.), suitable for pipeline data processing. Fine-grained 

40 parallelism results from applying the n PMEs In a section to a program phase- Applying coarse-grained 
partitioning to applications often results in discovering small repetitive tasks suitable for MIMD or memory 
bandwidih limited tasks suitable for SIMD processing. We program the MIMD portions using conventional 
techniques and program the remaining phases as A-SIMD sections, coded with vectorized commands, 
sequenced by the array controller. This makes ihe large controller memory the program store. Varying ihe 

45 number of PMEs per section perrnrss balancing the workload. Varying the dlspstehed Task stee permits 
balancing the BCI bus bandwidth to the control requirements. 

The programming model also considers allocating data elements to PMEs. The approach is to distribute 
data elements evenly over PMEo, In early versions of 9/vV. this will be done by the programmer or by 3AV. 
we recognfee that the IBM parallelizing compiler technologies apply to this problem and we expect to 

st? investigate their usage. However, the irtfer-PME bandwidth provided does tend to reduce the importantly oi 
this approach. This links data allocation and I/O mechanism performance. 

The H/W requires that the PME initiate data transfers out of Its memory, and it supports a controlled 
wrfte into PME memory without PME program involvement Input control occur? in tie receiving PME by 
providing an input buffer address and a maximum length. When I/O to a PME results In buffer overflow, 

ss H/W win interrupt the receiving PME The low level I/O functions that will be developed tor PMEs burld on 
this service. Wo will support either movement of raw data botwoen adjacent PMEs or movement oi 
addressed data between any PMEs. The last capability depends upon the circuit switched and store and 
forward mechanisms. The interpret address and forward operation is important for performance. We have 
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optimized the H/VV and &W to support the operation. Using on© word buffers results In an interrupt upon 
receipt of address header. Comparing target Id with local td permits output path ©election* Tronofcr of the 
subsequent data words occurs in circuit switched mode. A slight varlatbn on trite process using larger 
buffers results in a store and forward mechanism. 

5 Because of the high performance inter-PME bandwidth, it is not always necessary or desirable to place 
data elements within the PME Array careMy. Consider shifnng a vector data element distributed across 
PMEs. Our architecture can send data without an address header, thus, providing for very fast I/O. However, 
we have found, in many applications, that optimizing a data structure for movement In one direction, 
penalizes flat* movement in en orthogonal direction. The penalty in such situations approximates me 

10 average cost of randomly routing data in the network. This leads to applications where placing daia 
sequentially or randomly (as opposed to arranging data) results in shorter average process times. 

Many applications can be synchronized to take advantage of average access time. (For example, POE 
relaxation processes acquire data from a neighborhood and thuB. can average access over at least four LO 
operations.) We believe that after considering the factors applicable to vector and array processes, like 

;5 scaiterfgether or row'column arithmetic, many users will find brute force data ol location to be suitable tor the 
application* However* we know of examples that Illustrate application characteristics (like required synchro- 
nization or biased utilization of shift directions'} that tend to force parScular data allocation patterns. This 
characteristic requires that the tools and techniques developed support either manual tuning of the data 
placement* or simple and non-optimum daia allocation. (We win support the non-optimum data allocation 

20 strategy with host level macros to provide near transparent port of vectorized host programs to the MPP. 
The HAAf Monitor workstation will permit the user to investigate the resuharrt performance.) 

FIGURE 19 shows the general S/W development and usage environment The Host Application 
Processor is optional in that program execution can be controlled from ehher the Host or the Monitor, 
Further, the Monitor will effectively replace the Array Controller is some situations. The environment will 

20 support program execution on real or simulated MPP hardware. The Monitor Is scenario driven so that the 
developer doing test and debug operation* can create procedures to permit effective operation et any level 
of abstraction. 

FIGURE 20 Blustrates the levels of H/W supported within the MPP end the user Interfaces to these 
levels, 

so We see two potential application programming techniques tor the MPP. In the least programmer 
intensive approach, applications would be written in a vectorized high order language. If the user did not 
feel that the problem warranted tuning data placement then he woukl use compile time services to allocate 
data to the PME Array. The application would use vector calls like ELAS thai would be passed to the 
controiier for interpretation and execution on the PME Array. Unique calls would be used to move data 

35 between host and PME Array. In summary, the user would not need to be aware of how the MPP organized 
or processed the data. Two optimisation technique* *HI be supported for this type application: 

1. Altering tl>e data allocation by cor^ructing me data allocation table will permit programs to force data 
placements. 

2. Oeoeration of additional vector commends for execution by the array controiler will permit tuned 
*o subnotions {fe. calling the Gaussian Elimination as a single operation.) 

We also see that the processor can be applied to specialized applications as in those referenced in the 
beginning of mis section, in such cases* code tuned to the application would be used. However even in 
such applications the degree of tuning will depend upon how important a particular task is to the application. 
It is in this situation that we see the need for taste individually suited to SIMD, MIMD or A~81ftlD modes. 
4t Tnese programs will use a combination oft 

1. Sequences of PME native instructions passed to an emulator function within the array controller. The 
emulator will broadcast the instruction and Its' parameters to the PME set The PMEa in this SIMD mode 
win pass the instruction to fre decode function, simulating a memory fetch operation. 

2. Tight inner loop© that can be I/O syrKhronized will use PME native ISA prograrns. After initiation from 
so a SIMD mode change, they would run continuously In MIMD mode. (The option to return to SIMD mods 

via a 'RETURN' Instruction exists.) 

3. More compeared programs, as would be written in a vectorizing command set, would execute 
subroutines in the array controller that invoked PME native functions. For example a simplified array 
controller program to do a BLAS 'SAXPY' command on vectors loaded sequentially across PMEs would 

• Gaussian Elimination wlih normal pivoting requires shifting rows bui not columns, More than fct performance 
difference would result* from arranging rhe data such thot column* ware on the fast shift direction. Even wirh 
thai thsre is not an advantage to arranging rows in any particular relationship to the Puses. 
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start sequencas within the PMEs that: 

al Enable PMEb with required k elements via comparison of PME Id wlih broadcast Incx' and 
^eoW vaiues, 

b. Compress the x values via a writo to consecutive PMEs, 
s c. Calculate the address of PMEs wlih y elements from broadcast data 

d. Transmit the compressed x data io the y PMEs, 

e. Do a single precision floating point operation in PMEs receiving x values to complete the operation. 
Finally* the SAXPY example Illustrates one additional aspect of executing vectorteed application 

programs. The major steps execute in the API and could be programmed by either an optimizer or product 
;o developer. Normally, the vectorised application would call rather than include this levei o code. These steps 
would be written as C or Fortran code and wBI use memory mapped read or writes to control the PME array 
via the BCI bus. Such a program operates the PME array as a series of MIMD steps synchronized by 
returns to the API program , Minor steps such as the single precision floating point routines would be 
developed by tte Customer or Product Developer. These operations will be coded using the native PME 
75 ISA and will be tuned to the machine characteristics. In general, this would be the domain or the Product 
Developer since coding, teat and optimization at this level require usage of the complete product 
development tool set 

The APAP can have applications written In sequential Fortran. The path is quite dnferent, FIOURE 21 
outlines a Fortran compiler which can be used. In the first step. It uses a portion of the existing parallelizing 
to compiler to develop program dependencies. The source plus these tables become an input to a process 
that usee a characterfzauon of the APAP MMP and the source to enhance parallelism. 

This MMP Is a non-shared memory machine and as such allocates data between the PMEs for local 
and global memory. The very fast data transfer time* and the high network bandwidth reduce the time 
affect ot data allocation, but it still is addressed. Our approach treats part of memory as global and uses a 
29 $ArV service function. It Is also possible to use the dependency information to perform the data allocation In 
a second atemaSve. The final step in converting the source to mutHpte sequential programs is performed 
by the Level Partitioning step This partitioning step is analogous to the t Fortran sup Otef work being 
conducted with DARPA funding. The lest process In the compilation Is generation of the executable code at 
all individual functional levels. For the PME this wilt be done by programming the code generator on an 
existing compiler system. The Host and API code compilers generate the code targeted to those machines. 

The PME can execute MIMD software from its own memory. In general, the mufnple PMEs would no? 
be executing totally different programs but rather would be executing the same small program in an 
asynchronous manner. Three basic types of S/W can be considered alihough the design approach does not 
limit the APAP to just these approaches: 
as 1. Specialized emuta&on functions would make fte PME Array emulate the set of services provide by 
standard user libraries like UNPACK or VP$$. In such an emulation package, the p me Array could be 
using its multiple set of devices to perforin one of tie operations required in a normal vector call. This 
typo ot emulation, when attached to a vector processing unit, oould utilise iho vector unit for some 
operations while performing others internally. 
•** 2. The parallelism of the PME Array could be exploited by operating a art of software that provides a 
new set of mathematical and service functions in the PMEs. This set of primitives would be the codes 
exploited by a customizing user to construct his application. The prior example of performing sensor 
fusion on a APAP attached to a military platform would use such an approach. The customizer would 
write routines to perform Kalman Filters. Track Optimum Assignment and Threat Assessment using the 
45 supplied set of function names. This application woidd be a series of API call statements, and each call 
would result in initiating the PME set to perform some basic operation tike 'matrix multiply* on data 
stored within the PME Array. 

3. In cases where no effective method, considering performance objectives, o» application needs exists 
then custom &W could be developed 2nd executed wrthln the PME. A specific example is 'Sort*. Many 

so methods to sort data exist and the objective in all cases is to tune the process and the program to the 
machine architecture. The modified hypercube is well suited to a Batcher Sort; however, that son 
requires extensive calculations to determine particular elements to compare versus very short compari- 
son cycles. The computer program in FIGURE 17 shows a simple example of a PME program iiQQ to 
perform the Batcher Sort 1000 with one element per PME. Each line of the program description would be 

so expanded to 3 to 6 PME machine fevsf instructions, and all PMEs would (hen execute the program in 
MIMD mode. Program synchronization Is managed via the I/O statements. The program extends to more 
data elements per PME and to very large parallel sorts In a quite straight forward manner. 
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CC Storage Contents 

Data from the CC storage is used by the PME Array in one of two manners* When the PMEs are 
operating in 8IMD, 2 series of instructions can bo fetched by the CC and passed to the nodo BCI, thus, 
6 reducing load on both the API and CS. Alternatively, functions that are not frequently required, such as PME 
Fault ReoonflguraHon SrW, PME Diagnostics, aixJ perhaps conversion routines can be stored in the CC 
memory. Such functions can then be requested by operating PME MIMD programs or moved to the PMEs 
at the request of API program directives 

io Packaging of the 8-Way Modified Hypercube 

Our packaging techniques lake advantage of the eight PMEs packaged in a single chip and arranged in 
a N-cfimansiona! modified hypercube configuration* This chip level package or node of the array is the 
smallest building block in the apap design. These nodes are #eo packaged in an 8 x & errey where the 

/s X and the +■¥ makes rings within the array or cluster and the + -W, and «*-Z are brought out to ths 
neighboring clusters. A grouping of clusters make up an array. This step significantly cuts down wire count 
tor data and control tor the array. The W and Z buses will connect to the adjacent dusters and form W and 
2 rings to provide total connectivity around the completed array of various size. The massively parallel 
syslem vtffl be comprised or ihese duster building blocks to form the massive array of PMEs. The APAP 

20 wBI consist of an 8 X 8 array of clusters, each cluster will have its own controller and ell the cantroSeis will 
be synchronized by our Array Director. 

Many trade-offs of wlreebUity and topology have been considered, yet with these considerations we 
prefer the configuration which we illustrate with this connection. The concept disclosed has the advantage of 
keeping the X and Y dimensions within a cluster tevel of packaging, and distributing the W and 2 bus 

20 connections & all the neighboring clusters. 

After implementing the techniques dsscribedL me product will be wtoabte, and manufacturable white 
maintaining the inherent characteristics of the topology defined. 

The concept used here is to mix, match, and modify topologies at different packaging levels to obtain 
the desired results in terms of wire count. For the method ft define the actual degree of modification of the 

30 hypercube, refer to the Rolfe modified hypercube patent application referenced above. For the purpose ot 
this preferred embodiment we will describe iwo packaging levels to simplify our description, ft can be 
expanded, 

The first Is the chip design or chip package Illustrated by FIGURE 3 and FIGURE 11 . There are eight of 
the processing elements with their associated memory and communscetion logic encompassed into a single 

35 chip which is defined as a node. The internal configuration is classified as a binary hypercube or a 2-degree 
hypercube where every PME is connected to two neighbors See the PME-pme communication diagram in 
FIGURE a especially 500, 510, 520. 530. 5*0. 550, 560, 570, 

The second step is that the nodes are configured as an & X 8 array to make up a cluster. The folly 
populated machine Is buM up of an array of 8 X & clusters to provide the maximum capacity of 32738 

40 PMEs. These 4098 nodes are connected in an 8 degree modified hypercube network where fte commu- 
nication bstween nodes is programmable. This ability to program different routing paths adds flexibUUy to 
transmit different length messages, m addition to message length differences, there are algorithm optimis- 
ations that can be addressed with these programmable features. 

The packaging concept is intended to significantly reduce Ihe off page wire count for each of the 

<s clusters. This concept takes a cluster which Is defined as a 8 x 8 array of nodes 820. each node 825 having 
8 processing elements for a total of 512 PMEs, then to limit the X and Y ring within the cluster and, finally, 
to bring out the W and Z buses to all clusters. The physical picture could be envisioned as a sphere 
configuration 800, 810 of 84 smaller spheres 830. See FIGURE 15 lor a future packaging picture which 
illustrates the lull up packaging technique, limiting the X and Y rings 800 within the duster and extending 

so out the W and Z buses to all clusters S10. The physical picture could be envisioned as a sphere 
configuration of 64 smaller spheres ©0. 

The actual connection of a single node to the adjacent X and Y neighbors 975 exists within the same 
cluster. The wiring savings occurs when the Z and W buses are extended to the adjacent neighboring 
clusters as illustrated In FIGURE 16. Also illustrated in FIGURE 1 6 is me m of the chips or nodes that can 

5s be configured as a sparsely connected ^dimensional hypercube or torus d00, 905, 910, 915. Consider each 
of the 8 external porta to be labeled as +X, +y« +Z, + W, -X, -Y, -z> -w 950. 975. Then, using m chips* e 
ring can be constructed by connecting the +x to -X ports. Again m such rings can be interconnected into a 
ring of rings by interconnecting the matching + Y to -Y ports. This levef of structure will be called a cluster. 
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It provides for 51 2 PMEs and will be the building block for several s«e systems. Two such connections 
(950, 975) are shown In Urn diagram illustrated in FIGURE t8. 

Applications for Deskside MPP. 

s 

Hid deskside MPP in a workstation can be effectively applied in several application areas including: 

1. Small production tasks that depend upon compute intensive processes. The US Postal Service 
requires a processor that can accept a tax Image of a machine printed envelope and then find and read 
the zip code. The process la needed at all regional sort facilities and Is an example ol e very repetitive 

io but sift compute intensive process. We have implemented API language versions of a sampia of the 
required programs. These models emulate the vector and array processes that will be used to do the 
work on in* MPP. Based upon this test, we know that the task is an excellent match to fae processing 
architecture. 

2. Tasks in which an analyst, as e result of prior output, or expected needs requests sequences of data 
? 5 transformations. In an example taken from the Defense Mapping Agency, satellite images ore to be 

transformed and smoothed pixel by pixel Into some other coordinate system. In such a. situation, the 
transformation parameters for the image vary across localities as a result of ground elevation and elope. 
The analyst must therefore add fixed control points and reprocess transformations. A similar need occurs 
in the utilization of scientific simulation results when users require almost real time rotation or perspeo 
20 tive changes. 

3. Program development for production versions of the MPP will use workstation size MPPs. Consider a 
tuning process that requires analysis ol processor versus network performance. Such a task Is machine 
and analyst interactive, ft can require hours when the machine is idle and me analyst is working. When 
performed on a supercomputer \i would be very costly. However, providing an affordable workstation 

29 MPP with the same (but scaled) characteristics as the supercomputer MPP eliminates coats and eases 
the test and debug process by eliminating the programmer ineftictenctes related to accessing remote 
processors. 

FIGURE 22 ts a drawing of the workstation accelerator. It uses the same sl2e enclosure as the 
RiSC<6000 model 530. Two swing out gates, each containing a tap cluster ere shown. The combined two 

30 clusters provide 5 GOPS of fixed point performance and S30 MftopS of processing power and about 100 
Mbyte/s of I/O band*id3i to the array. The unit would be suitable tor any of the prior applications. With 
quantity production and including a host RI8C.-'6000, it would be price comparable with high performance 
workstations, not al the price of comparable machines employing old technology. 

as Description of the AvVACS Sensor Fusion 

The military environment provides a series of examples stowing the need for e hardened compute 
intensive processor. 

Communication In the targeted noisy environments Implies the need tor digitally encoded cornmunica- 

40 lions, as ie used in ICNIA systems. The process ef encoding the data tor trensmjasion and recovering 
information after receipt is a compute intensive process. The task can be done with specialized signal 
processing modules, but tor situations where communication encoding represents bursts of activity, 
specialized modules are mostly idle. Using the MPP permits several such tasks to be allocated to a single 
module and saves weight power, volume and cost. 

<$ Sensor data fusion presents a particularly clear example of enhancing an existing p&form with the 
compute power gained from the addition of MPP, On fte Air Force E3 AWAC9 there are more than four 
sensors on the platform, but there le currenUy no way to generate tracks resulting from the Integration of all 
available data. Further, the existing generated tracks have quite poor quality due to sampling characteristics. 
Therefore, there is motivation to use fusion to provide an effective higher sample rate. 

so We have studied this sensor fusion problem in detail and can propose a verifiable and effective solution, 
but that solution would overwhelm the compute power available in an AWACS data processor, figure 23 
shows the traditional track fusion process. The process is faulty because each of the Individual processes 
tends to make some errors and the final merge tends to collect them Instead of eliminating them. The 
process Is also ciwractorized by high time latency in that merging does not complete until the slowest 

55 sensor completes. FIG U RG 24 presents an improvement and the resulting compute intensive problem with 
the approach. Although we cannot solve a NP-Herd problem, we have developed a good method to 
approximate m solution. While the details of that application are being described by the inventors 
elsewhere, as it can be employed on a variety of machines Ifte an Intel Touchstone with 512 i860 <80S60) 
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processors, and IBM's Scientific Visualization System, it can be used as an application suitable for the MMP 
using the APAP design described here with ©ay 128.000 PMEs, substantially outperforming these other 
systems. Application experiments show the approximation quality is below the level of sensor noise end as 
such the answer is applicable to applications like AWAC8- FIGURE 26 shows the processing loop involved 

s In ihe proposed Lagrangean Reduction n -dimensional Assignment algorithm. The problem uses very 
controlled repetitions of the well known 2-tfmen$k>n&! assignment problem, the same algorithm tiwrt 
classical sensor fusion processing uses. 

Suppose for example that the n-dlmensiooal algorithm was to be applied to the seven sets of 
observations illustrated m FIGURE 24 and further, suppose thai each peas through a reduction process 

io required four iterations ihrough a 2d Assignment process. Then the new fid Assignment Problem would 
require 4000 iterations of tne 2d Assignment Problem. TT>e awac$' workload is now about 00% of machine 
capacny. Fusion perhaps requires 10% of the total effort, but even that small effort when seated up 4000 
times results in total utilization begng 370 times ihe capacity of an AWAC3. Not only doss this workload 
overwhelm the existing processor, but it would be marginal In any new MIL environment suited* coarse- 

;s grained, parallel processing system currently existing or anticipated in the next few years. If the algorithm 
required an average of 5 rather than 4 iterations per step* then ft would overwhelm even the hypothesised 
systems. Conversely > the WPP solution can provide the compute power and can do so even at ihe 6 
iteration level. 

r» Mechanical Packaging 

As Illustrated in FIGURE 3. and other FIGURES, our preferred chip is configured in a quadflaipack form. 
As such ft can be brickwalled into into various 2 O and 3 O configurations in a package. One chip of eight or 
more processor memory elements is a first level package module, the same as e single DRAM memory 

so chip is to a foundry which packages the chip. However, It Is In a quadfletpack form, allowing connections to 
one another in four directions. Each connection is point to point. (One chip in its first level package is a 
module to the foundry.) We are abfe to construct PE arrays of sufficient magnitude to hit our performance 
goals due to this feature. The realty Is <hat you can connect these chips across 3, 4 or even five feet point- 
to-point, 1.6. multi-processor node to node, and sq'P have proper control without the need of fiber epics. 

so This has an advantage for the drive/receive circuits that are required on the modules. One can achieve 
high performance and keep the power dissipation down because we do not have bus systems that daisy 
chain from module to module. We broadcast from node to node, but this need not be a Mgh performance 
path. Most data operations can be conducted In a node, so data path requirements are red need. Our 
broadcast path is essentially primsrdy used as a controller routing tool. The data stream attaches to and 

as runs lix the ZWXY communication path system. 

Our power dissipation is 22 watts per node module for our commercial workstation. This allows us to 
use air cooled packaging. The power system requirements for our system are also reasonable because of 
this fact. Our power system illustrated multiplies (he number of modules supported by about 2 5 watts per 
module, end such a five volt power supply is very cost effective Those concerned with the amount of 

to electricity consumed would be astonished that 32 microoomputdrs could operate with less than the wattage 
consumed by a reading tight 

Our thermal design is enhanced because of the packaging. We avoid hot spots due to high dissipating 
parts mixed wltn low dissipating ones. This reflects directly on the cost of the assemblies. 

The cost of our system is very attractive compared to the approaches that put a superscalar processor 

45 on a card Our performance level pet assembly per wau per connector per part type per dollar Is excellent 
Furthermore, we do not need the earn* number of packaging levels that the other technology does. We 
do not need module/card/bscfcpSane and oable. We oan skip the card level if we want to. As Illustrated in our 
workstation modules, we have skipped the card level wiih our brickwaJJed approach. 

Furthermore, as we illustrated in our layout, each node housing which is brickwalled m the workstation 

so modules, can as illustrated In FIGURE 3 comprise multiple replicated dies, even within (he same chip 
housing. While normally we would place one die wfthln an air cooled package, n is possible to place 8 die 
on a substrate using a multiple chip module approach- Thus, the envisioned warah with 32 or more 
processors* Is possible, as well as many other applications. The packaging and power and flexibility make 
applications which are enotess. A house could have Its controllable instruments an watched, and coendi- 

ss nated with a very small pari Those many chips thai ore spread around an automobile for engine watching, 
brake adjustment and so on ooufd all have a rnonltor within a housing, in addition, one the same substrate 
with hybrid technology, one could mount o 308 microprocessor chip with full programmable capability and 
memory (all in one chip) and use it as the array controller for (he substrate package. 
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We have shown many configurations of systems, from control systems, FIGURE 3. to larger and larger 
systems. The ability to package a chip with multiple processor memory element of eight or more on a chip 
in a (lip, with pinouts fitting in a standard DRAM memory module, such $3 in a SIM module make possible 
countless additional applications ranging from controls to wall size video displays which can have a 

s repetition rate, not a the 15 or so frames that press the existing technology today, but at 30 frames, witti a 
processor assigned to monitor a pixel, or a node only a few pixels. Our bridewell quadriatpacfc makes it easy 
to replicate the same part time over and over again. Furthermore, the replicated processor is really memory 
wfch processor interchange. Pari of the memory can be assigned to a specific monitoring task, and anc*ei 
part (with a size programmeticalry defined) can be a massive global memory, addressed poinHo-point, with 

jo broadcast to ail capability. 

Our basic workstation, our supercomputer our controller, our AWACS, en are examples of packages 
that can employ our new technology. An array of memory, with inbuilt CPU chips and MX functions as a 
PME of massively parallel applications, and even more limited applications. The flexibility of packaging and 
programming makes imaginations expand and our technology allows one pan to be assigned to many ideas 

'5 and images. 

Military Ayjorjcs Appf cations 

The cost advantage of constructing a MIL MPP Is particularly well illustrated by the AWACS. It Is a 20 
2$ year old enclosure that has grown empty space as new technology memory moduiee have replaced the 
original core memories. FIGURE 26 snows a MIL quaiifiable wo cluster system that would frt directly into 
the rack's emply space and would use the existing memory bus system for Interconnection. 

Anhough the AWACS example is very advantageous due to the existence of empty space, in other 
systems it is possible lo create space. Replacing existing memory with a smell MPP or gateway to an an 
29 isolated MPP is normally quite viable. In such cases, a quarter duster and a adapter module would result In 
a 8 Megabyte memory plus 640 MIPs and use perhaps two slots. 

Supercomputer Application 

a> A 64 cluster MPP Is a 13.6 Gflop supercomputer. It can be configured in a system described in FIGURE 
27. 

The system we describe allows node chips to be brick watted on cluster cards as illustrated in FIGURE 27 
to build up systems with some significant cost and size advantages. There Is no need lo Include extra chip a 
such as a network switch in such s system because it would increase costs* 

35 Our interconnection system with "brickwaJled" chips allows systems to be built like massive DRAM 
memory is packaged and will have a deOned bus adapter conforming to the rigid bus specification*, for 
instance a MicroChannel bus adaptor. Each system will have a smaller power supply system and cooling 
design than other systems based upon many modem microprocessors. 

Unlike most supercomputers our current preferred APAP with floating point emulation Is much faster In 

«> integer arithmetic (164 GIPS) than it is when doing floating point arithmetic. As such, the processor would 
be most effective when used in applications that are very character or integer intensive. We have 
considered three program challenges which in addition to the other applications discussed herein are 
needfol of solution. The applications which may be more important than some of ma "grand challenges" to 
day to day life include: 

1. 5090 Vector Processors contain a very high petformanoe floating point arithmetic unit- That unit, as do 
most vectorized floating point units, requires pipeline operations on dense vectors. Applications that 
make extensive use of non-regular sparse matrices tie. matrices described by bit maps rather than 
diagonals) waste the performance capability of the floating point unit The MPP solves this problem by 
providing the storage for the data and using its compute power and network bandwidth, not to do the 
so calculation but rather to construct dense vectors, and to decompress dense results. The Vector 
Processing Unrt is kept busy by a continual flow of operations on dense vectors being supplied to it by 
the MPP, By sizing ihe MPP so mat it can effectively compress and decompress at the same rate the 
Vector Facility processes, one could keep both units fuHy busy* 

2, Another host atached system we considered is a solution to the FBI fingerprint matching problem. 
53 l-lsre, a machine with more than 64- clusters was considered. The problem was to match about 8000 
fingerprints per hour against the entfre database of fingerprint history. Using rnoosfre DA$D and the full 
bandwidth of the MPP to host attachment, one can roll the complete data base across the incoming 
prints in about 20 minutes. Operating about 75% of the MPP in a 8IMD mode coarse matching 
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operation, balances processing to required throughput rate. We estimate that 15% of the machine in A- 
SIMD processing mods kouM ihen complete the matching by doing the detailed verification of unknown 
print versus file print for cases passing the coarse filter operation. The remaining portions Of the machine 
were In MIMD mode and allocated to reserve capacity, work queue management and output formatting, 
a 3. Application of the MPP to database operations has been considered. Although the work is very 
preliminary, It does seem to be a good tnatch. Two aspects of the MPP support this premiss: 

a. The connection between a cluster Control Isr and the Application Processor Interface Is a Micro- 
Channel. As such, It could be populated with DASD dedicated to the cluster and accessed directly 
from the cluster, A 64 duster system with six 640 Mbyte hard drives attached per cluster would 

10 provide 246 Gbyte storage. Further, that entire database could be searched sequentially in 10 to 20 

seconds. 

b. Databases are generally not searched sequentially. Instead they use many levels of pointer?. 
Indexing of databases can be done wfchin the cluster. Each bank of DASD would be supported by 2.5 
GIPS of processing power and 32 Mbyte of storage. That is sufficient for both searching and storing 

J5 the indices. Since indices are now frequently stored within the DASD, significant performance gains 

would occur. Using such an approach and dispensing DASD on SCSI Interfaces attached to tha cluster 
MicroChannel permits effectively unlimited size data bases, 
FIGURE 27 is an illustration of the APAP when used to build the system into a supercomputer scaled 
MPP. The approach reverts to replicating units, but here It is enclosures containing IS clusters that are 
50 replicated. The particular advantage of this replication approach is that the system can be scaled to suit the 
user's needs. 

System Architecture 

2S An advantage of the system architecture which is employed in the current preferred embodiment Is the 
ISA system which will be understood by many who will form a pool for programming the APAP, The PME 
ISA consists of the following Data and Instruction Formats, illustrated in the Tables. 

Data Formats 

SO 

The basic (operand) size is the 16 bit word. In PME storage, operands are located on integral word 
boundaries. In addition to the word operand size, oflher operand sizes are available in multiples of 16 bits to 
support additional functions. 

Within any of the operand lengths, the bit positions of the operand are consecutively numbered from 
35 left to right starting! with the number 0. Reference to high-order or most-significant bite always refer to the 
left-most bit positions. Reference to the low-orxJer or least-slgnincam bits always refer to rjie right-most bit 
positions. 

Instruction Formats 

40 

The length of an instruction format may either be 16 bits or 32 bits, in PME storage, instructions must 
be located on a 16 bit boundary. 

The following general Instruction formats are used. Normally, tha ftrst four bits of an instruction define, 
the operation code and are referred to as Ihe OP bits. In some cases, additional bits are required to extend 
46 the definition of the operation or to define unique conditions which apply to the instruction* These bts are 
referred to as OPX bhs. 
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format Code 


Oparatioo 


RR 


Register to Register 


OA 


Direct Address 


R6 


Register Storage 


Rl 


Register Immediate 


SS 


Storage to Storage 


SPC 


Special 



All formats have one field tn common. This field and its interpretation is: 

Bits 0*3 Operation Code - TMs fteld> sometimes in conjunction with en operation code extension 
>5 field, defines the operation to be performed. 

Detailed figures of the individual formats along wHh Interpretations of their fields are provided In the 
following subsections. For some instructions* two formats may be combined to form variations on the 
instruction. These primarily involve the addressing mode for the instruction. As an exampfo a storage to 
storage Instruction may have a form which Involves direct addressing or register addressing. 

RR Format 

The Register-Register {RR} formal provides two general register addresses and is id bits m length as 
shown, 

25 



OP 


RA 


0 8 0 0 


RB 


\ \ I 


_l i 1 


1 , 1, 1 


i i i 



G 3 4 7 B 11 1 
1 2 5 



in addition to an Operation Code fMd> the RR formal contains: 

Bits 4-7 Register Address i - Tl>e RA field is used to specify which ot the 16 general registers is 

to be used as an operand and** destination. 
Bits 3*11 Zeros - Bit $ being a zero defines the format to be a RR or DA format and bits 9-11 
<*? equal to zero define flis operation to be a register to register operation <& special case of 

the Direct Address formal). 
Sts 12-15 Register Address 2 - The RB field is used to specify which of the 16 general registers Is 
to be used as an operand. 



48 DA Format 



The Direct Address <DA) format provides one general register address and one direct storage address 
as shown. 
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20 



29 



OP 



RA 

I | t 



DIR ADDR 



1 I I I I t 



3 4 



7 8 9 



In addition to an Operation Code fiefcj, me DA format contains: 

Bits 4-7 Register Address 1 - The RA field is used lo specify which of the 16 general rasters is 

to be used as an operand andtor destination. 
Bit 8 Zero ■ This bit being zero defines the operation to be a direct address operation or a 

register to register operation. 
Bits 9-16 Direct Storage Address - The Direct Storage Address field is used as an address into 

the level unique storage block or the common storage block. Bits 0-1 1 of the direct 

address 3eld must be iK>iv*eit> Id define the direct address form. 



RS Format 

The Raster Storage 
address. 



format provides one general register addresses and en indirect storage 
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OP 
Mil 


RA 

1 f 1 


1 


DEI 

1 L 


Re 

i i i 


9 3 


4 7 


8 9 1 
1 


1 i 

2 5 



h addition to an Operation Code field, tne RS format container 

Bits 4-7 Register Address 1 - The RA field is used to specify which of the 16 gemaral registers is 

to be used as en operand end/or destination. 
Bit 8 One - Thie bit being one define* the operation to be a register storage operation. 

Bits 9»t1 Register Data - These bits are considered a signed value which is used to modify the 

contents ot register specified by the RB field. 

Register Address 2 - The RB field Is used to specify which of the 16 general registers is 
id be used as an storage address for an operand. 



BUS 12-15 
Rl Format 



The Register-immediate <Ri) format provides one gei>erai register eddress and 16 bits of immediate 
cteta, The Rl format is $2 bite ot fengfo as shown: 
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OP 


RA 


1 


DEL 


0 0 6 G 


IMMEDIATE DATA - 


L 1 1 


..111 




LJ 


' I 1 


-Li U..I 1 > 1 1 



3 4 



70 



7 8 9 11 

1 2 



3 
1 



In addition to an Operation Code field, the Rl format contains: 

Bits 4*T Register Address 1 - The RA field is used to specify which of the ie general registers 1$ 
>5 to bo used as an operand and/or destination. 

Bit 8 One - litis bH being one defines the operation to be a register storage operation. 

Bis 9-11 Register Data - These bits ere considered a signed value which is used to modify the 
contents of the program counter. Normally .this field would have a value of one for the 
register Immediate formaL 

to Bas 12-18 Zeroes - The field being zero i* used to specify %$k the updated pron/am counter, which 
points to the immediate data field, is to be used es an storage address for an operand. 
Bits 16-31 immediate Data - THs field serves as a IS bit immediate data operand for Register 
Immediate instniciicns. 

29 $$ Format 

The Storage to Storage (SSj format provides two storage addresses, one explicit: and the second 
Implicit The Implied storage address is contained In General Register 1. Register 1 is modified during 
execution of the instruction. There are two forms of a SS instruction a direct address form and a storage 

30 address fcnm. 





OPX 






















(Direct 


OP 




0 


c 


0 


01 R ftDDR 


Address 




0 


V 


R 






Form) 




p 


F 


Y 








I I I 


1 








' M 1 M 





0 3 4 7 8 9 1 

5 





OPX 


























(Storage 


OP 




0 


c 


1 


DEL 


m 


Address 




0 


V 


R 








form) 




p 


F 


Y 










1 _ 


1 








1 1 


i i i 





0 3 4 7 8 9 11 1 

l ^ 5 
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In addition to an Operation Code field, the SS formal contains: 

Bits 4-7 Operation Extension Code - The OPX Held, together with the Operation Code. defines 
ihe operation to be performed. Bits 4-5 define the operation type such as ADD or 
SUBTRACT. Bits 5-7 control the carry, overflow, and how the condition code will be set. 
Bit3 s o Ignores overOow. bit 6 = 1 allows overflow. Bit 7 = 0 ignore the carry stat 
during the operation; bit 7 « 1 includes ihe carry stat during the operation. 

Bit 6 Zero - Defines the form to be a direct address form. One - Defines me form to be a 

storage address form. 

Bits 9-15 Direct Address (Direct Address Form) - The Direct Storage Address field (s used as 
an address into ihe level unique storage block or She common storage block. Gits 9-11 of 
the direct address fleid must be nonzero to define the direct address form. 

Bits 9-11 Register Delta (Storage Address Form) - These bits are considered a signed value 
whioh is used to modity the contents of register specified by the RB field. 

Bits 12*15 Register Address 2 (Storage Address Fgrm) - The RB field is used to specify which 
of the 1 8 general registers is to be used as a storage address tor en operand. 



SPC Format 1 

The Special (SPCf ) formal provides one general register storage operand address. 



so 



30 



OP 


OPX 


0 


LEN 


RB 


RR Form 


1 1 1 


i 1 1 




' t 


...I 1 t 





3 4 



7 8 



1 1 
1 2 



3 4 



7 8 



1 1 
1 Z 





OP 


OPX 


1 


LEN 


RB 


38 


2 1 1 


t 1 1 




I i 


.III 



1 

5 



RS Form 



40 



4t> 



30 



In addition to an Operation Code field, the SPCt formal contains; 



Bits 4-7 
Bit 8 

Bite 9-11 



Bits 12-1* 



SPC Format 2 



OP Extension - The OPX field is used to extend the operation code. 

Zero or One - This bft being zero defines the operation to be a register operation. This 

bit being one defines the operation lo be a register storage operation. 

Operation Length - These bits are considered an unsigned value which Is used to 

specify the length of the operand in 16 bit words. A value of zero corresponds to a length 

of one, and a value of ET11 r corresponds to a length of eight 

Register Address a - The RB field is used to specify which of the 1 a oeneiei registers ie 

to be used as a storage address for the operand. 



The Special ($PC2> formal provides one genera? register storage operand address. 
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OP 

I 1 1 


RA 

i I.J 


OPX 
1 1 1 


RB 

f 1 1 


O 3 


4 7 


ft X 


1 1 



1 2 



(n addition io an Operation Code field, the SPC2 format contains: 

Bits 4-7 Register Address 1 - The RA ifeld is used to specify which of the 16 goners] registers is 
to be used as an operand end /or destination, 
is Bite S-11 OP Extension - The OPX field is used to extsnd iUe operation code. 

Bits 12*15 Register Address 2 - The RB field is used to specify which of the 16 genera! registers te 
to be used as a storage address for tte operand. 

THE INSTRUCTION LIST OF THE ISA INCLUDES THE FOLLOWING: 



Table 1 (Page 1 C f Z). Hued-Point Ariihmelic Induction* 
NAME 

ADD DIRECT 



MME- TYPME 
MONJC 

ada OA 
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Table 1 f^agfe 7 Of 3). Fixed-Painl Arithmetic Imtrvsiiona 





NAME 


MNE- 


TYP* 






MONIC 






ADD FROM STORAGE 


a 


RS 




(WITH DELTA) 


awd 


RS 




AOO IMMEDIATE 


ai 


Rl 


16 


(WITH DELTA) 


alwd 


Rl 




ADD REGISTER 


ar 


RR 




COMPARE DIRECT ADDRESS 


ccte 


DA 


;s 


COMPARE IMMEDIATE 


d 


Rl 




(WITH DELTA) 


ciwd 


Rl 




COMPARE FROM STORAGE 




RS 




(WITH DELTA) 


cwd 


RS 




COMPARE REGISTER 


w 


RR 




COPY 


CDV 


RS 


23 


{WITH DELTA) 


u yj y v» u 


i\ j 




COPY WITH BOTH IMMEDIATE 


r ovhi 


Rl 




{WITH DELTA) 


cpyhrwd 


Rl 


30 


COPY IMMEDIATE 






(WITH DELTA) 


cpytwd 


Rl 




COPY DIRECT 


cpyda 


DA 




COPY DIRECT IMMEDIATE 


cpydai 


DA 


35 


INCREMENT 


inc 


RS 




(WITH DELTA) 


incwd 


RS 




LOAD DIRECT 


Ida 


DA 


iO 


LOAD FROM STORAGE 


1 


RS 




(WITH DELTA) 


iwd 


RS 




LOAD IMMEDIATE 


ii 


Rl 


41 


(WITH DELTA) 


liwd 


Rl 




LOAD REGISTER 


ir 


RR 




MULTIPLY SIGNED 


mpy 


SPC 


20 


MULTIPLY SIGNED EXTENDED 


mpyx 


SPC 




MULTIPLY SIGNED EXTENDED IMMEDIATE 


mpyxi 


SPC 
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Tabi* 1 (Pag* 3 of 3}, Fixed-Pcini Arit'hm«w<: FnMivction* 
NAME 

5 

MULTIPLY SIGNED IMMEDIATE 
MULTIPLY UNSIGNED 
MULTIPLY UNSIGNED EXTENDED 
MULTIPLY UNSIGNED EXTENDED IMMEDIATE 
MULTIPLY UNSIGNED IMMEDIATE 
STORE DIRECT 
'= STORE 

(WITH DELTA) 

STORE IMMEDIATE 
20 (WITH DELTA) 

SUBTRACT DIRECT 

SUBTRACT FROM 370RAOE 
n JWJTH DELTA) 

SU8TRACT IMMEDIATE 
{WITH DELTA) 

SUBTRACT REGISTER 

SWAP AND EXCLUSIVE OR WITH STORAGE 



MNE- 
MONIC 

mpyi 

mpyu 

mpyux 

rnpyuxi 

mpyut 

stda 

st 

stwd 
sti 

stiwd 
sda 

5 

swd 
SE 

dtwd 
sr 

swapx 



TYPME 

SPC 

SPC 

SPC 

SPC 

SPC 

DA 

RS 

RS 

Rl 

Rl 

DA 

RS 

RS 

Rl 

Rl 

RR 

RR 



Table 7 (Page 1 ol 3). Storage to Storage Irtiirucliuns 

mm 

ADD STORAGE TO STORAGE 

(WITH DELTA) 
ADD STORAGE TO STORAGE DIRECT 
ADD STORAGE TO STORAGE FINAL 

(WITH DELTA] 
ADD STORAGE TO STORAGE FINAL DIRECT 
ADD STORAGE TO STORAGE INTERMEDIATE 

(WITH DELTA) 



MNE- 
MONIC 

sa 

aada 

saf 

saiwd 

satda 

sai 

sa'wd 



TYPME 

SS 
SS 
SS 
SS 
SS 
SS 
SS 
SS 
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Table 2 (Page 2 of 3). Storage jo Storage iftGlrtictiane 








NAME 


MNE- 


TYI 






MONIC 






ADD STORAGE TO STORAGE INTERMEDIATE 








DIRECT 


saida 


SS 




ADD STORAGE TO STORAGE LOGICAL 


sal 


ss 


10 


(WITH DELTA) 


salwd 


ss 




ADO STORAGE TO STORAGE LOGICAL DIRECT 




ss 




COMPARE STORAGE TO STORAGE 


sc 


ss 


JS 


(WITH DELTA) 


scwd 


ss 




COMPARE STORAGE TO STORAGE DIRECT 


scdd 


ss 




COMPARE STORAGF TO STORAGF FINAI 




SS 


so 


(WITH DELTA* 








COMPARE STORAGE TO STORAGE FINAL DIRECT 


scfd a 


SS 




COMPARE STORAGE TO STORAGE INTER MEDIATE 






5$ 


fWlTH DELTA* 




KS 




COMPARE STORAGE TO STORAGE INTERMEDIATE 








DIRECT 


scidd 


SS 




COMPARE STORAGE T 0 STORAGE LOGICAL 


scl 


ss 

WW 


3D 


{WITH DELTA) 


JblWVI 


ss 




COMPARE STORAGE TO STORAGE LOGICAL 








DIRECT 


set da 


ss 


38 


MOVE STORAGE TO STORAGE 


srnov 


ss 




{WITH DELTA) 


smovwd 


ss 




MOVE STORAGE TO STORAGE DIRECT 


srnovda 


ss 


tO 


SUBTRACT STORAGE TO STORAGE 


3S 


ss 




(WITH DELTA) 


S3Wd 


ss 




SUBTRACT STORAGE TO STORAGE DIRECT 


ssda 


ss 




SUBTRACT STORAGE TO STORAGE FINAL 




ss 




(WITH DELTA) 


ssfwd 


ss 




SUBTRACT STORAGE TO STORAGE FINAL DIRECT 


ssfda 


ss 


30 


SUBTRACT STORAGE TO STORAGE INTERMEDIATE 


ssi 


ss 


(WITH DELTA) 


ssiwd 


ss 
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JO 



Tabid 2 {Pag© 3 q( 3). Sioi 396 Storage instf actions 
NAME 

SUBTRACT STORAGE TO STORAGE INTERMEDIATE 
DIRECT 

SUBTRACT STORAGE TO STORAGE LOGICAL 

(WITH DELTA) 
SUBTRACT STORAGE TO STORAGE LOGICAL 

DIRECT 



MO NIC 

sal da 
ssl 

3s1wd 
sslda 



ss 

SS 

ss 
ss 



Table 3 



•20 



30 



i Logical tnsisucfilonfi 


NAME 


MNEMONIC 


TYPME 


AND DIRECT ADDRESS 


nda 


DA 


AND FROM STORAGE 


n 


RS 


(WITH OELTA) 


nwd 


RS 


AND IMMEDIATE 


nl 


Rl 


(WITH DELTA) 


niwd 


Rl 


AND REGISTER 


nr 


RR 


OR DIRECT ADDRESS 


oda 


DA 


OR FROM STORAGE 


0 


RS 


(WITH OELTA) 


owd 


RS 


OR IMMEDIATE 


ol 


Rl 


(WITH OELTA) 


oiwd 


Rl 


OR REGISTER 


or 


RR 


XOR DIRECT ADDRESS 


xda 


DA 


XOR FROM STORAGE 


X 


RS 


(WITH DELTA) 


xwd 


RS 


XOR IMMEDIATE 


xi 


Rl 


(WITH DELTA) 




Rl 


XOR REGISTER 


xr 


RR 
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Table 4 1 of 71 Shin Insimction* 

NAME MNE» TYPME 

WON1C 

SCALE BINARY scale spc 

SCALE BINARY IMMEDIATE scalel SPC 

SCALE BINARY REGISTER scaler SPC 

" SCALE HEXADECIMAL sca i eh SPC 

SCALE HEXADECIMAL IMMEDIATE scalehl SPC 



/S 


SCALE HEXADECIMAL REGISTER 


scatehr 


SPC 




SHIFT LEFT ARITHMETIC BINARY 


sla 


SPC 




SHIFT LEFT ARITHMETIC BINARY IMMEDIATE 


slai 


SPC 


Of) 


SHIFT LEFT ARITHMETIC BINARY REGISTER 


slar 


SPC 




SHIFT LEFT ARITHMETIC HEXADECIMAL 


slah 


SPC 




snlrT LEFT ARITHMETIC HEXADEulMAL IMMEDIATE 


siahj 


SPC 


20 


SHIFT LEFT ARITHMETIC HEXADECFMAL REGISTER 


sJahr 


SPC 




SHIFT LEFT LOGJCAL BINARY 


si! 


SPC 




SHIFT LEFT LOGICAL BINARY IMMEDIATE 


sill 


SPC 


30 


SHIFT LEFT LOGICAL BINARY REGISTER 


slJr 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL 


silh 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL IMMEDIATE 


sllhi 


SPC 


35 


SHIFT LEFT LOGICAL HEXADECIMAL REGISTER 


sJlhr 


SPC 




SHIFT RIGHT ARITHMETIC BINARY 


sra 


SPC 




SHIFT RIGHT ARITHMETIC BINARY IMMEDIATE 


srai 


SPC 




SHIFT RIGHT ARITHMETIC BINARY REGISTER 


srar 


SPC 


40 


SHIFT RIOHT ARITHMETIC HEXADECIMAL 


srah 


SPC 




SHIFT RIGHT ARITHMETIC HEXADECIMAL IMME- 


srahl 


SPC 




DIATE 








SHIFT RIGHT ARITHMETIC HEXADECIMAL REGISTER 


srahr 


SPC 




SHIFT RIGHT LOGICAL 8INARY 


srl 


SPC 


30 


SHIFT RIGHT LOGICAL BINARY IMMEDIATE 


srli 


SPC 
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Teble 4 {Patje ? of 2). Shift ln«|rMC<?0n* 








NAME 


MNE- 


TYPME 






MONIC 




s 


SHIFT SLIGHT 1 nfUPAl RINAKY RFf^l<5TFR 


<zr\r 

<M II 






SHIFT RIGHT LOGICAL HEXADECIMAL 


srlh 






wini~l f\|\jni LUai^ML nC.A.SUtV*»»PVl/AL J JV1 Wl CLH rU t 


srlhi 




70 


onirl KlunJ LvJoK~ML UtAAUtV-iIVlAL KfcoJolfcf* 


srlhr 


■O or* 


*5 


Table S (Page 1 of 2). Branch Instructions 








NAME 


MNE- 


TYPME 






MONIC 




20 


BRANCH 


b 


RS 




(WITH DELTA) 




RS 




BRANCH DIRECT 


bda 


DA 


25 


BRANCH IMMEDIATE 


bi 


Rl 




(WITH DELTA) 


biwd 


Rl 




BRANCH REGISTER 


br 


RS 


30 


BRANCH AND LINK 


bal 


RS 


BRANCH AND LINK DIRECT 


baida 


DA 




BRANCH AND LINK IMMEDIATE 


bail 


Rl 




(WITH DELTA) 


balitod 


Rl 


35 


BRANCH AND LINK REGISTER 


balr 


RS 




BRANCH BACKWARD 


bb 


RS 




(WITH DELTA) 


bbvvd 


RS 


40 


BRANCH BACKWARD DIRECT 


bbda 


DA 




BRANCH BACKWARD IMMEDIATE 


bbl 


Rl 




(WITH DELTA) 


bblwd 


Rl 




BRANCH BACKWARD REGISTER 


bbr 


RS 




BRANCH FORWARD 


bf 


RS 




(WITH DELTA) 


blwd 


RS 


50 


BRANCH FORWARD DIRECT 


bfda 


DA 
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JO 



00 



re 



Table 5 (f>2ge 2 of 2). Branch tnaiructioos 
NAME 

BRANCH FORWARD IMMEDIATE 

(WITH DELTA) 
BRANCH FORWARD REGISTER 
BRANCH ON CONDITION 

(WITH DELTA) 
BRANCH ON CONDITION DIRECT 
BRANCH ON CONDITION IMMEDIATE 

(WITH DELTA) 
BRANCH ON CONDITION REGISTER 
BRANCH RELATIVE 

(WITH DELTA) 
NULL CPMERATION 



MNE- 
MONIC 
bfi 

bfiwd 

bfr 

be 

bewd 
beda 
bci 

bdwd 

bcr 

brel 

brelwd 

noop 



TYPME 

Rl 

Rl 

RS 

RS 

RS 

RS 

Rf 

Rl 

RS 

Rl 

RS 

RR 



Tabled 



30 



Stesus Switching hstiuctions 


NAME 


MNEMONIC 


TVPME 


RETURN 


ret 


5PO 



0$ 



Tabid 7 



InpuVOutput tosiruetbns 


NAME 


MNEMONIC 


TYPME 


IN 


IN 


SPC 


OUT 


OUT 


SPC 


INTERNAL DIGFVDIOW 


INTR 


SPC 



50 
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SOME SUMMARY FEATURES 

The APAP Machine In Perspective 

We have described in accordance with our invention could be thought of In its more detailed aspects to 
be positioned in the technology somewhere between the CM-i and N-cube- Uke our APAP, the CM-1 uses 
a point design for the processing element and combines processing elements with mismory on the basic 
chip. The however uses a 1 bit wide serial processor .while iha APAP series will use a 10 bit wide 
processor. The CM series of machines started woh 4K bite of memory pei processoi and lies Grown to 8 or 
18K brte voreue tho 32K by 1$ bits to have provided tor the first APAP ohip. The CM-1 and Hs foUowons 
ere 9trictfy SlMD machines while (lie CMS Is a hybrid. Instead of this, our APAP will effectively' use MIMD 
operating modes in conjunction with 8IMD modes when useful. While our parallel 16 bit wide PME$ might 
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k» vtewed as a step toward the N-cube, this stop is not warranted. The APAP does not separate memory 
and routing from the processing element as does the N-cube kind of machine. A! go, the APAP provides Tor 
up to 32K 18 bit PME$ while the N-cube only provides for 4K 32 bit processor 

Even with the superficial similarities presented above, the APAP concept completely differs from the 

5 CM end N-cube series by: 

1. The modified hypercube incorporated in our APAP is a new invention providing a significant packaging 
and addressing advantage whan compared with hypercube topologies. For instance, consider that the 
32K PME APAP In its first preferred embodiment has a network diameter of 19 logical steps end. with 
transparency, this can be reduced to en effective 16 logical steps. Further, by oornpartson, i? a pure 

70 hypercube were used, and if all PMEs were sending data through an 8 step path, then on average 2 of 
every 8 pmes would be active while the remainder would be delayed due to blockage. 

Alternately, consider the G4K hypercube that would oe needed it CM-1 was a pure hypercube. In 
that case, each PME would require porls to 16 other PMEs, and data could be routed between the two 
farthest separated PMEs in 15 logical steps. If all PMEs tried to transfer an average distance of 7 steps, 

>5 the 2 of every 7 would be active. However, CM-1 doss not utilize a I6d hypercube. ti interconnects the 
16 nodes on a chip with a NEWS network: then it provides one router function within the chip. The 4096 
routers are connected into a 1 2d hypercube Wnh no collisions the hybrid sHR has a logical diameter of 
15, but since 16 PMEs could be contending for the link He effective diameter Is much greater. That is, 
with 8 step moves only 2 of 16 PMEs could be active, which means that 8 complete cycles rather than A 

to cycles are needed to complete eU date moves. 

The N-cube actually utilizes a pure hypercube, but currently onh/ provides for a 4096 PMEs and 
thus* utilizes a i2d (13d for 8192 PMEs) hypercube. For the N-cube to grow to 16K processors, at which 
point it would have the same processing data width as the APAP, it would have to add four limes as 
much hardware and would have to increase the connection porta to each PME router by 25%. Although 

29 no hard date exists to support this conclusion, ft would appear that the N-cube architecture runs out of 
connector pins prior to reaching a 16K PME machine. 

2. The completely Integrated and distributed nature of major tasks within the APAP machine is a decided 
advantage* As was noted for the CM and N-cube series of machines, each had to have separate units for 
message routing as well as separate units for floating point coprocessors. The APAP system combines 

30 the integer, floating point processing, message routing and I/O control Into the single point design PME. 
That design is then replicated 8 iimes on a chip, and the chip Is free replicated 4K times to produce Hie 
array. This provides several advantages: 

a. Using one chip means maximum size production runs and minima] system factor costs. 
b> Regular architecture produces the most effective programming systems, 
as c. Almost all chip pins cam be dedicated to the generic problem of irrterprocessor communication, 

me*lml2tno, the inter-chip I/O bandwidth which tends to be a Important limiting factor in MPP designs. 

3. The APAP has the unique design ability to take advantage or chip technology gains and capital 
investment in custom chip designs. 

Consider the question of floating point performance. It Is anticipated that APAP PME performance on 

6 OAXPY will be about 125 cyctee per flop, in contrast, the '387 Coproceseor would be about 14 cycles 
while ihe Wertec Coprocessor in the CM-1 would be about 6 cycles. However, in the CM case there is 
only one floating point unit for every 16 PMEs while in the N-cube case there is probably one *38? type 
chip associated wish eacn of the '38$ processors. Our APAP has 16 times as many PMEs and therefore 
can almost completely make up for the single unit performance delta. 

<*5 More significantly, the 8 APAP PMEs within a chip are constructed from sok gates currently 

available in the technology. As memory macros shrink and the number of gates avaHabte to the logic 
increases. Spending that Increase on enhanced floating point normalization should permit APAP floating 
point performance to far exceed the other units. Alternatively, effort could be spent to generate a PME or 
PME subsection design using custom design approaches, enhancing total performance while in no way 

so affecting any S/W developed lor the machine. 

Wa believe our des?gn for our APAP has characteristics poised to take advantage or the future 
process technology growth, in contrast, she nearest similar machines CM-x and N-cube which employ a 
system like that described in FIGURE 1 seem well poised to take advantage of yesterday's technology 
which we feel is dead ended, 
59 An advantage of the APAP concept is the ability to use OASO associated with groups of PMEs- This 
APAP capebiifey, as well as the ability to connect displays and euxfflaiy storage, is a by-product of ptcWng 
MC bus structures as me Interface to the external I/O ports of the PME Array, Thus. APAP systems will be 
configurable and can include card mounted hard drives selected from one of the set of units that are 
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compatible with PStf or RISC/8000 units. Further, that capability shoufd be available without designing any 
additional part number modules although it does require utilizing more replications or the bockpanel and 
bass enclosure than does the APAP. 

Thie brief perspective is not intended to be limiting, but rather is intended to cause those sMtied in the 

6 art to review the foregoing description and examine how the many Inventions we have described which may 
be used to move me art ot massively parallel systems ahead to a time when programming la no longer a 
significant problem and the costs of such systems are much lower. Our kind of system can be made 
available, not only to the few, but to many as It could be made at a cost within the reach of commercial 
department level procurements. 

10 While we have described our preferred embodiments of our invention, it will be understood that those 
skilled In the art both now and in the future, upon the understanding of these discussions will make various 
improvements and enhancements thereto which tall within the scope of the claims which follow. These 
claims should be construed to maintain the proper protection for the invention first disdOBed, 

ts Claim* 

1. A computer system comprising e plurality of muHi-prooessor memory elements, each having commu- 
nication paths, processor and memory, and wherein a programmable router is provided for routing data 
and control Information from one multi-processor memory element to another multi-processor memory 

so element end between nodes of the computer system. 

2. A computer system according k> claim i wherein each nuiltfr-oroeessor memory element <PME* has 2n 
processors, and communication paths which minimize delays due to chip crossings, 

2$ 3» A computer system according to claim 1 wherein each murfrprocesscr memory element (PME) has a 
processor, memory and routers wfthin a single chip and internal and external communication paths 
which minimize delays due to chip crossings* each processor memory element having means tor fixed 
and floating point processing, routing and I/O control. 

so 4. A computer system according to claim I further comprising within a processor memory element 

a native instruction set, means for providing an expandable multiply function, a programmer router for 
routing information alternatively teftfrfghi, 

NEWS mairR NEWS/up-down. hypercube. and wherein said programmable router is employs a 
hardwired distributed router provided by each processing memory element. 

38 

5. A computer computer system according to claim 1 organized as a massively parallel machine with 
nodes interconnected as a n dimensional neiv/ork cluster with parallel communication pafts between 
processor memory eternertfs along said internal and external communication paths providing a process- 
ing array* and wherein processing memory elements of an eney have a transparent mode utilized when 

<o routing data between processing memory elements within a chip set ol processing memory elements 
for permitting reduction of the effective network diameter of a network of nodes. 

6. A computer system according to claim i wherein a node of a processor array has muWpte stogie 
processor elements made up of 32K 16-bit words with a 16-bit processor for a network node of eight 

it processors with their, associated memory with their fully distributed I/O routers and signal I/O ports. 

7. A computer system according to claim 1 wherein a node of a processor array has muJtipte single 
processor elements made up of 32K i(H>it words with a i6-bit processor for a network node of eight 
processors with their associated memory with their fully distributed MO routers and signal VO porta 

30 combined as groups groups of node clusters organized as a 2d modified hypercube. 

& A computer system according to claim i wherein a node oi a processor array has muWpfe single 
processor elements made up of 02K 16-bil words with a 1 0-bit processor for a network node of eight 
processors with their associated memory with their fully distributed I/O routers end signal VO porta 
55 combined as groups groups of node clusters organized as a 2d modified hypercube> with up to 64 
clusters integrated in a network of node clusters to form are integrated to form a id modified 
hypercube of up to 32,760 processing memory elements. 



59 



EPO 570 729 A2 



9, A computer system according to claim 1 wherein a node processing memory etemsnt has Internal data 
flows using high speed hard registers to feed distributed ALU and and I/O router registers and logic for 
ail operation 

10l A computer system according to claim 1 wherein a node processing memory element In has an I/O 
port for oft cmp byte wide oomrnunicetion. and has input porta that are connected such thai data may 
be routed from input to memory, or from an input address register to an output register via a dlreoi 
parallel data path. 

11. A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes in a duster network with data routing distributed between hardware and 
software, with software controlling most oi the task sequencing function, 

12. A computer system according to claim i wherein a node has multiple processor memory elements and 
is connected to other nodss in a cluster network with date routing distributed between hardware and 
software* with hardware provided tor performing inner loop transfers and minimizing overhead on the 
outer roops of the node. 

13. A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes in a duster network with mo programs at dedicated interrupt levels *or 
managing the network, 

14* A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes in a cluster networii with VO programs at dedicated interrupt levels for 
managing the network, each processor memory element having interrupt registers and dedicating four 
interrupt levels to receiving data from four neighbors, a buiter provided at each level by loading 
registers ai the level, and having in and return instruction pairs using a buffer address and transfer 
count to enable the processor memory element to accept words from an input bus and to store them to 
the buffer, 

IS. A multi-processor memory system, comprising: a plurality of mufti-processor memory elements, each 
muta-processor memory etemsrri (PM6) having 2n processors, memory and routers within a single chip 
and internal and externa! communication petite which minimise deJaya due to chip crossings. 
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